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In  this  report  we  present  the  results  of  our  efforts 
towards  an  alqorithaic  treatment  of  several  problems  related  to 
normalization  in  relational  data  base  schemas.  Our  goal  was  to 
find  efficient  algorithms  for  the  solution  of  these  problems.  The 
problems  we  treat  are:  Synthesizing  a  third  normal  form  schema 
from  a  set  of  functional  dependencies,  deciding  the  membership 
problem  for  functional  dependencies,  synthesizing  a  Boyce-Codd 
normal  form  schema  from  a  set  of  functional  dependencies  a  nd 
finding  all  keys  of  a  given  relation.  The  first  problem  is 
treated  in  part  I,  the  second  problem  is  treated  in  part  II,  and 
the  last  two  problems  are  treated  in  part  III. 

The  notion  of  normalization  in  relational  data  bases  was 
presented  by  Codd  shortly  after  he  introduced  the  relational 
model.  Attributes  in  a  relational  schema  may  be  related  by 
various  relationships.  One  type  of  relationship  is  the  functional 
dependency.  It  was  found  that  if  attributes  in  a  relation  are 
related  by  functional  dependencies  in  certain  ways,  then  various 
problems  such  as  update  anomalies  may  arise.  Normalization  was 
offered  as  a  solution  to  this  problem.  As  presented  originally  it 
was  an  ad  hoc  procedure.  The  relational  schema  was  assumed  to  be 
given.  If  undesirable  functional  dependencies  existed  in  a 
relation  of  the  schema  then  the  relation  was  decomposed  into  two 
or  more  relations  in  which  the  problem  did  not  exist.  The  process 
was  repeated  until  no  problem  functional  dependencies  existed  in 
any  relation;  then  the  schema  was  said  to  be  in  third  normal 
form . 


The  functional  dependency  has  an  important  role  in 
definition  and  construction  of  normalized  relational  schemas, 
is  also  an  important  concept  in  the  modeling  of  the  semantic 
relational  data  bases.  It  has,  therefore,  been  suggested  that 
might  be  more  appropriate  to  start  the  construction  of  a  sc 
with  a  set  of  functional  dependencies.  The  relations  of 
schema  would  then  be  synthesized  from  the  given  functi 
dependencies.  The  success  of  this  approach  lies  in  the  exist 
of  a  suitable  synthesis  algorithm.  Such  an  algorithm  sh 
produce  a  schema  that  represents  the  information  embodied  in 
given  set  of  functional  dependencies  and  is  in  third  normal  f 
It  should  also  be  efficient  enough  to  be  used  in  real 
situations. 
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There  have  been  several  attempts  in  the  literatur 
describe  synthesis  algorithms.  However,  as  shown  in  [2],  none 
these  algorithms  was  correct.  It  was  observed  in  [2]  that 
looking  for  a  synthesis  algorithm  one  must  take  into  account 
algebraic  properties  of  functional  dependencies.  Also, 
synthesis  algorithm  based  on  these  properties  was  prese 
there.  However,  examples  were  given  of  cases  where 
algorithm  produced  non-normal ized  schemas.  These  examples  a 
to  dll  the  previously  published  algorithms  as  well. 
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In  the  first  part  of  this  report  we  present  a  synthesis 
algorithm  which,  to  our  knowledge,  is  the  first  to  satisfy  the 
goals  described  above.  The  synthesized  schema  represents  (in  a 
precisely  defined  way)  the  information  contents  of  the  original 
set  of  functional  dependencies,  it  is  provably  in  third  normal 
form  and  ,  furthermore,  it  contains  the  minimal  number  of 
relations  among  all  possible  schemas.  At  the  end  of  part  II  we 
present  an  efficient  implementation  of  this  algorithm.  Therefore, 
we  believe  that  this  algorithm  is  a  satisfying  solution  to  the 
synthesis  problem. 

A  possible  reason  why  all  the  previous  algorithms  have 
failed  is  that  the  problem  was  not  specified  precisely.  In 

particular,  one  need  to  know  what  is  the  exact  relationship 
between  the  original  set  of  functional  dependencies  and  the 
resulting  schema.  We  define  here  what  it  means  for  a  schema  to 
represent  the  information  content  of  a  set  of  functional 

dependencies.  In  Codd*s  papers  on  normalization  functional 

dependencies  were  considered  as  information  external  to  the 
schema.  For  us  this  posed  a  problem  since  we  do  not  know  of  any 
relational  system  in  which  functional  dependencies  can  be 
defined.  The  concept  of  representation  offers  a  solution  to  this 
orohlem;  functional  dependencies  are  now  part  of  the  schema  and 
need  not  he  defined  separately.  We  believe  that  this  notion  of 
representation  will  prove  to  be  useful  in  the  theory  of 

relational  schemas. 

A  basic  operation  which  is  used  many  times  in  the 
synthesis  algorithm  is  to  check  if  a  given  functional  dependency 
can  be  derived  from  other  functional  dependencies.  We  call  this 
operation  the  membership  test.  In  the  second  part  of  the  paper  we 
present  an  algorithm  for  the  membership  test  that  works  in  linear 
time.  rising  this  algorithm,  we  present  a  quadratic  time 
i  mple  m'^n  ta  tion  of  the  synthesis  algorithm.  We  doubt  if  a 
significant  improvement  on  this  time  bound  is  possible.  Also, 
using  a  known  relationship  between  functional  dependencies  and 
the  propositional  calculus  we  show  that  the  membership  algorithm 
can  be  used  to  decide  tau tologihood  for  a  restricted  class  of 
propositional  formulas  in  linear  time.  (The  last  result  is  not 
connected  to  the  rest  of  the  report.) 

The  results  in  the  first  two  parts  indicate  that  the 
algorithmic  approach  to  the  construction  of  relational  schemas  is 
quite  successful.  This  success  seems  to  be  based  on  the  use  of 
the  algebra  of  functional  dependencies.  Motivated  by  this 
success,  we  tried  to  analyze  two  other  problems  of  relational 
schemas  which  can  be  formulated  in  terms  of  functional 
dependencies.  These  problems  are  the  construction  of  Boyce- Codd 
ncrraal  form  schemas  and  the  computation  of  all  keys  of  a 
relation. 


The  normal  form  we  have  been  using  is  the  one  called  in 
the  literature  third  normal  form.  A  stronger  normal  form,  called 
Boyce-Codd  normal  form,  was  introduced  in  [8].  One  might  want  to 
know  if  our  synthesis  algorithm  would  produce  schemas  in  Boyce- 
Codd  normal  form.  We  show  that  it  is  not  so.  There  are  sets  of 


3 


functional  dependencies  that  cannot  be  represented  by  Boyce-Codd 
normal  form  schemas*  Also,  even  when  a  Boyce-Codd  normal  form 
schema  exists  for  a  set  of  functional  dependencies,  the  algorithm 
may  produce  a  non-Boyce-Codd  normal  form  schema.  Clearly,  the 
problem  of  deciding  if  a  given  schema  is  in  Boyce-Codd  normal 
form  is  algorithmically  solvable.  However,  we  prove  that  the 
problem  is  NP-complete.  These  results  suggest  that  a  feasible 
algorithm  (that  is,  an  algorithm  that  runs  in  polynomial  time) 
for  synthesi2:ing  Boyce-Codd  normal  form  schemas  does  not  exist. 

A  key  of  a  relation  is  a  minimal  subset  o  f  its 
attributes  such  that  all  other  attributes  of  the  relation  are 
dependent  on  it.  The  synthesis  algorithm  produces  for  each 
relation  one  or  more  keys.  We  present  examples  where  additional 
(non- synthesi2ed)  keys  exist  in  relations.  Several  alqori^-hros 
have  been  described  in  the  literature  for  the  computation  of  all 
keys  of  a  relation.  However,  they  are  all  slow  --  they  may  take 
exponential  time  in  the  worst  case.  We  prove  that  the  problem  of 
whether  additional  keys  exist  in  a  given  relation  is  NP-complete 
and  explain  why  this  result  implies  that  an  efficient  key  finding 
algorithm  probably  does  not  exist. 

The  report  is  organized  in  three  parts  which  are,  to 
some  extent,  independet.  It  is  possible  to  understand  the  results 
of  one  part  without  having  to  read  first  the  other  parts. 
However,  sections  1,2  and  1.4  contain  basic  material  which  is 
essential  to  the  understanding  of  all  three  parts  and  should, 
therefore,  be  read  before  part  II  and  Part  Til. 

The  work  reported  here  was  done  during  the  first  half  of 
1976.  The  first  part  is  an  extension  and  an  improvement  of  the 
results  in  chapters  2,5  of  [2],  It  will  be  published  shortly  in 
the  ACH  Transactions  On  Database  Systems, 
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This  method  assumes  the  existence  of  at  most  one 
functional  relationship  connecting  any  one  set  of  attributes  to 
another.  This  uniqueness  assumption,  which  is  required  by  all 
earlier  methods  as  well,  raises  difficult  semantic  questions  that 
will  be  discussed  in  de+ail. 
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I . 2  THE  H ELAIIONAL  MODEL 
1.2.1  Relations 


In  Codd's  relational  data  base  model,  mathemat 
relations  over  a  set  of  domains  are  used  to  describe  connect 
among  data  items  [6],  However,  not  ail  relations  serve  equ 
well  in  describing  these  connections  [7].  To  "judge  the  effi 
of  various  classes  of  relations,  we  begin  by  reviewing 
terminology  associated  with  the  relational  model. 

Conceptually,  a  relation  is  a  table  in  which  each  co 
corresponds  to  a  distinct  at  tribute  and  each  row  to  a  dist 
er.ti;t;y  (or  tu£le)  .  For  each  attribute  there  is  a  set  of  poss 
associated  values,  called  the  domain  of  that  attribute.  It 
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common  for  different  attributes  to  share  a  single  domain.  For 
example,  the  attributes  0nANTITY_IN_STOCK  and  SIZ E_OF_CLASS  both 
assume  values  from  the  domain  called  NON-NEGATIVE  INTEGERS. 

An  <entity,  attribute>  entry  in  a  relation  is  a  value 
associated  with  the  entity  chosen  from  the  domain  of  the 
attribute.  Formally,  a  relation  is  a  (finite)  subset  of  the 
cartesian  product  of  the  domains  associated  with  the  relation*3 
attributes. 

The  notation  for  describing  the  structure  of  a  data  base 
relation  includes  a  relation  name  (say  R)  and  a  set  of  attributes 
in  R  (say  (A1,  A2,  ...,  An}),  and  is  written:  R(A1,  A2,  ...,  An), 
e.g.,  see  fig.  la.  The  ordering  of  attributes  is  immaterial, 
since  attribute  names  are  distinct  within  a  relation.  (This  is 
one  reason  for  distinguishing  between  attributes  and  domains.) 
Notat iona lly ,  we  will  use  upper  case  letters  near  the  beginning 
of  the  alphabet  for  simple  (i.e. ,  singleton)  attributes  (e.g..  A, 
E,  C)  and  ones  near  the  end  of  the  alphabet  for  composite  (i.e., 
groups  of)  attributes  (e.g.,  X,  Y,  Z)  . 

The  set  of  entities  that  comprise  a  relation  normally 
changes  over  time,  as  entities  are  inserted,  deleted,  and 
modified.  This  is  one  important  way  that  data  base  relations 
differ  from  mathematical  relations. 

The  word  "relation"  is  often  used  in  the  literature  to 
describe  both  the  structure  of  the  relation  (e.g.,  R  (A  1 , . . , , An) )  , 
called  its  intention ,  which  is  static,  and  the  set  of  tuples  in 
the  relation,  called  its  extension.  In  the  sequel,  the  word 
" itiil  tefer  to  an  intention  unless  ex2licitly  stated 
i§®  •  That  is,  we  will  usually  be  referring  to  the 
structure  of  a  relation,  rather  than  the  set  of  tuples 
t  hemselves . 

T  .  2 .  2  Function^  Dependencies 

As  we  will  see  in  later  sections,  it  is  important  to 
consider  functional  relationships  when  choosing  how  to  group 
attributes  into  relations.  Functional  relationships  among  data 
base  attributes  are  formalized  in  the  concept  of  functional 
dependency. 

Let  A  and  B  be  attributes,  let  DOM  (A)  be  the  domain  of  A 
and  DOM (3)  be  the  domain  of  B,  and  let  f  be  a  time-varying 
function  such  that  f : DOM  (A) ->DOM (B) .  f  is  not  a  function  in  the 
precise  mathematical  sense,  because  we  allow  the  extension  of  f 
to  vary  over  time  in  the  same  sense  that  we  allow  extensions  of 
data  base  relations  to  change  over  time.  That  is,  if  f  is  thought 
of  as  a  set  of  ordered  pairs  {(a,b)  )  a  e  DOM (A)  and  b  €  DOM(B)}. 

then  at  every  point  in  time  for  a  given  value  of  a  e  DOM  (A)  ther-^ 
will  be  at  most  one  value  of  b  €  DOM (B) ,  To  distinguish  f  from 
mathematical  function,  we  call  f  a  functional  dependency  (abbr. 
FD).  For  notational  convenience,  we  generally  leave  out  the 
"DOM"s  and  write  f:A->B.  If  there  is  an  FD  f:A->B,  then  B  is  said 
to  be  functionally  dependent  on  A. 
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Figure  1 

Funct iopal  Dependencies 

(a)  An  example  of  a  relational  schema 

EMPLOYEE  (EMP£r  NAME,  DEPT#) 
DEPAPTMENT  (DEPT#,  MGP#) 

INVENTORY  ( STOCK DEPT # ,  QTY) 


(b)  Functional  dependencies  for  the  above  schema 

EMP#  ->  NAME 
EMP#  ->  DEPT# 

DEPT#  ->  MGP# 

STOCK#,  DEPT#  ->  QTY 


7 


The  above  definitions  are  generalized  in  the  obvious  way 
for  functional  dependencies  over  compound  attributes.  If  X  = 
fA1,.,.,An}  and  Y  =  are  sets  of  attributes,  then  f:X 
->  Y  means  f : DOM (A  1)  x. . . x DOM (An)  ->  DOM (B1 ) x. . . xDCM  (Bm) .  He  will 
normally  leave  off  the  set  notation  in  FDs  and  write 
f :  {A1 , . . . , An}  ->  simply  as  f:A1,...,An  -> 

As  an  example,  the  functional  dependencies  for  attributes  in 
figure  la  are  given  in  figure  1b. 

In  this  paper  we  will  assume  that  for  any  two  sets  of 
attributes  X  and  Y,  there  is  at  most  one  FD  X  ->  Y.  At  tributes 
may  need  to  be  renamed  to  guarantee  this  assumption.  This 
restriction  is  an  important  one,  and  will  be  discussed  in  detail 
in  section  4,2.  He  will  also  show  later  that  nonfunctional 
relationships  need  not  satisfy  this  uniqueness  assumption. 

Given  this  assumption,  if  f:A->B,  then  we  will 
frequently  write  A->B  as  an  abbreviation.  The  notation  A/>B 
means  that  there  is  no  FD  A  ->  B  that  is  of  interest  (although  at 
a  given  point  in  time  in  some  relation,  it  may  be  true  that  no 
value  of  A  has  more  than  one  corresponding  value  of  P) . 

Let  R(A1,,.,,Aa)  be  a  relation  and  let  X  be  a  subset  of 
{A1,...,An},  X  is  called  a  key  of  R  if  every  attribute  in 
f.A1,..,,An}  that  is  not  in  X  is  functionally  dependent  upon  X  and 
if  no  subset  of  X  has  this  property.  Clearly,  a  relation  can  have 
many  keys.  A  su^erke^  of  P  is  any  set  of  attributes  in  R  that 
contains  a  key  of  E,  (Every  key  is  also  a  superkey.)  The  concept 
of  superkey  is  introduced  mainly  to  simplify  our  proofs  in  later 
sections. 

1.2.3  Operations  on  delations 

In  his  original  description  of  the  relational  model, 
Codd  introduced  the  relational  algebra  as  a  data  manipulation 
language  for  the  relational  data  base  model  [6}.  There  are  two 
basic  relational  algebraic  operations  that  will  be  of  some 
interest  to  us:  projection  and  join. 

The  projection  of  the  extension  of  a  relation,  R,  on  a 
subset  of  its  attributes,  X,  is  the  set  of  tuples  obtained  by 
excising  those  attributes  not  in  X.  If  two  tuples  are  now 
indistinguishable  because  they  only  differed  in  the  attributes 
that  were  eliminated,  then  they  are  "merged"  into  a  single  tuple. 
That  is,  the  result  of  the  projection  must  be  a  subset  of  the 
cartesian  product  of  the  domains  associated  with  the  attributes 
of  X. 


The  join  operation  is  used  to  make  a  connection  between 
attributes  that  appear  in  different  relations.  The  only  joi* 
operation  we  will  consider  here  is  the  natural  join  (i.e. 
equality  join).  The  natural  join  of  the  extension  of  a  relatio 
P  (A, E)  with  the  extension  of  relation  S  (B,C)  on  domain  B,  denoted 
?*S,  is  defined  to  be  {(a,b,c)  |  (a,b)  e  R  and  (b,c)  8  S}.  That 

is,  it  links  together  all  values  of  A  and  C  that  are  related  to 
common  B  values. 


a)  ■r^ 
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First,  relation-valued  domains  are  excluded  from 
s.  A  relation  is  in  first  normal  form  (abbr.  INF)  if 
omain  contains  simple  values.  There  are  two  main 

es  to  INF  [6].  First,  it  allows  the  data  base  to  be 
as  a  collection  of  tables  --  a  very  simple  and 

ndable  structure.  Second,  it  permits  the  definition  of  a 
class  of  primitive  operators  that  are  capable  of 

ting  relations  to  obtain  all  necessary  logical 

ons  among  attributes. 


The  second  and  third  normal  forms  are  introduced  to 
problems  caused  by  certain  functional  dependencies.  To 
these  problems,  consider  the  relation 

(STOCK#^  DIPT# ,  QTY,  MGR#)  obtained  by  joining  the 
NT  and  INVENTORY  relations  of  fig.  la  on  the  attribute 
The  insertion  of  the  first  inventory  item  for  a 
ar  DEPT#  into  the  extension  of  DEPT_INV  creates  a  new 
on  betweeen  that  DEPT#  and  its  MGR#.  The  deletion  of  the 
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last  inventory  item  for  a  particular  DEPT#  loses  the  connection 
between  that  DEPT#  and  its  HGRt.  These  side  effects,  called 
insert ion "deletion  anomalies,  only  occur  when  the  first  or  last 
tuple  of  a  DEPT#  is  inserted  or  deleted.  Also,  the  repetition  of 
the  connection  between  a  DEPT#  and  its  WGfi#  for  each  STOCK#  in 
the  DEPT#  can  lead  to  an  inconsistent  relation  if  arbitrary 
updates  on  individual  tuples  are  permitted.  These  problems  arise 
because  MGR#  is  functionally  dependenct  on  only  part  of  the  key 
STOCK#,  DEPT#.  To  eliminate  these  problems  from  DEPT_INV, 
DEPT_INV  must  be  put  into  second  normal  form. 

A  partial  dependency  occurs  when  an  attribute  is 
functionally  dependent  upon  a  subset  of  a  set  of  attributes.  Let 
f : A1 , . . . , An- >B  and  g; A1 , . . . ,Am- >E  be  functional  dependencies 
where  m<n.  The  attributes  Am+1,  Am<-2,  ...,  An  are  extraneous  in 
f,  since  A1,  ...,  Am  are  sufficient  to  functionally  determine  B. 
In  this  case,  B  is  said  to  be  partially  dependent  on  A1,...,An. 
If  for  a  given  f  there  is  no  g  with  the  above  property,  then  R  is 
fully  dependent  on  A1,...,An.  That  is,  there  are  no  extraneous 
attributes  in  the  domain  of  f. 

If  an  attribute  Ai  appears  in  any  key  of  R  then  it  is 
said  to  be  prime  in  P.  Otherwise,  it  is  nonprime  in  R.  A 
relation  is  in  second  normal  form  (abbr.  2NF)  if  it  is  in  INF  and 
each  of  its  nonprime  attributes  is  fully  dependent  upon  every  key 
[7].  The  relation  DEPT_INV  (STOCK#^DEPT# ,  QTY,  MGR#)  is  not  in 
2NF,  because  MGR#  is  a  nonprime  attribute  and  is  partially 
dependent  on  the  key  STOCK#,  DEPT#.  The  relations  DEPT  and 
INVENTORY  in  fig.  la  are  in  2NF. 

Consider  now  the  relation  EMP_DEPT  (EMP# ,  NAME,  DEPT#, 
MGR#)  obtained  by  joining  the  EMPLOYEE  and  DEPARTMENT  relations 
of  fig.  1a  on  DEPT#.  Although  EMP_DEPT  is  in  2NF,  it  displays 
the  same  problems  as  DEPT_INV.  Inserting  or  deleting  the  first 
EMP#  in  a  particular  DEPT#  creates  an  anomaly,  for  a  DEPT#-MGR# 
connection  is  created  or  destroyed  in  the  process.  The 
repetition  of  the  DEPT#-MGB#  connection  for  each  EMP#  in  the 
DFPT#  creates  the  same  consistency  problem  as  in  DEPT_INV.  In 
this  case,  the  problems  arise  because  MGR#  is  functionally 
dependent  on  the  key  EMP#  via  the  attribute  DEPT#.  To  eliminate 
the  problems,  the  relations  EMP_DEPT  must  be  put  into  third 
normal  form. 

Let  P(A1,...,An)  be  a  relation.  An  attribute,  Ai ,  is 
t dependent  upon  a  set  of  attributes,  X,  if  there 
exists  a  set  of  attributes,  Y  c  (A1,...,An},  such  that  X  ->  Y,  Y 
/>  X,  and  Y  ->  Ai  with  Ai  not  an  element  of  X  or  Y. 

A  relation  is  in  third  normal  form  (abbr.  3NF)  if  none 
of  its  nonprime  attributes  are  transitively  dependent  upon  any 
key  [7].  A  3NF  relation  is  also  in  2NF;  for  if  an  attribute  Ai 
is  partially  dependent  on  a  key  X,  then  Ai  is  transitively 
dependent  on  X,  since  X  ->  X*,  X’  />  X,  and  X'  ->  Ai  for  some  X* 
c  X.  The  relation  EMP_DEPT  is  not  in  3NF,  because  MGR#  is 
nonprime  and  is  transitively  dependent  upon  the  key  EMP#.  All  of 
the  relations  in  figure  la  are  in  3NF  (and  hence  2NF) ,  given  the 
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FDs  cf  figure  1b.  Further  examples  of  normal  form  relations  and 
surrounding  problems  can  be  found  in  [6,7,9], 


I. 3  SYNTHESIZING  A  RELATIONAL  SCHEKA 

1.3,1  The  Sjrn  thesis  Problem  an^  Nonfunctiona  1  Relationshi  ps 
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The  approach  of  building  a  relational  schema  from  FDs 
r^sts  entirely  on  the  ability  to  represent  all  data  relationships 
as  FDs.  Clearly,  though,  not  every  logical  connection  in  the 
world  is  functional.  Nevertheless,  we  claim  that  all  connections 
among  attributes  in  a  data  base  description  can  be  represented  by 
FDs.  As  long  as  connections  are  functional  there  is  of  course  no 
problem.  Nonfunctional  connections  require  special  treatment. 


A  nonfunctional 
■^tributes  A 1,  A  2,,..,  An  will 
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connection,  f,  among  a  group  of 
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ther  FD.  Each  FD  representing  a 
as  its  own  private  ©  attribute.  The 
these  6  attributes  is  the  set  {0,1}. 
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among  A1,,.,,An.  For  example,  a 
tween  a  DRIVER  and  AUTOHOBILE,  where 
by  more  than  one  DRIVER  and  each 
ne  AUTOMOBILE,  is  represented  by  the 


Notice  that  more  than  one  nonfunctional  relationship  can 
exist  among  a  set  of  attributes  without  violating  the  uniqueness 
assumption  of  FDs.  For  example,  we  can  have  a  second  relationship 
between  DRIVER  and  AUTOMOBILE  that  indicates  ownership; 
DRIVER , AUTOMOBILE  ->  62,  By  assigning  a  unique  ©  to  each 
nonfunctional  relationship,  the  uniqueness  assumption  for  FDs  is 
retained . 


This  6  notation  allows  us  to  represent  all  nonfunctional 
relationships  as  FDs.  The  synthesis  algorithm  will  produce 
approximately  one  relation  for  each  of  these  nonfunctional 
relationships.  In  section  5,  we  will  show  precisely  how  each  of 
these  "nonfunctional  FDs"  becomes  embodied  in  the  synthesized 
relational  schema. 


1.3.2  Formalizing  the  Synthesis  Problem 

Although  the  motivation  for  the  synthesis  problem  is 
from  data  base  management,  one  can  formalize  the  problem  in 
purely  symbolic  terms  as  follows.  We  are  given  a  set,  S,  of 
symbols  (i.e.,  attributes)  and  a  set,  F,  of  mappings  of  sets  of 
symbols  into  symbols  (i.e,,  FPs) .  The  problem  is  to  find  a 
collection  C  =  fC1,...,Cm}  of  subsets  of  S  (i.e.,  a  collection  oi 
relations)  and  for  each  Ci  a  collection  of  subsets  of  Ci  (i.e.,  a 
collection  of  keys  for  each  relation)  satisfying  three 
properties:  First,  F  is  ''embodied"  in  C  (i.e.,  the  relations 
embody  the  qiven  FDs) ,  Second,  each  Ci  can  have  no  transitive 
dependencies  (i.e.,  it  is  in  3NF) .  Third,  the  cardinality  of  C 
is  minimal. 

This  treatment  of  the  problem  is  still  somewhat  fuzzy, 
since  we  have  not  yet  discussed  the  algebraic  rules  for  composing 
FDs.  To  motivate  the  need  for  these  rules,  we  present  a  simple 
synthesis  algorithm.  This  algorithm  ignores  algebraic 
considerations  and  will  be  shown  to  be  inadeguate. 

1.3.3  A  Simpl^e  Synthesis  Procedure 

One  (overly)  simple  way  to  obtain  relations  from  a  given 
set  of  FDs  is  to  group  together  all  a  ttri butes  that  are 
functionally  dependent  upon  the  same  set  of  attributes.  This 
suggests  the  following  procedure.  First,  partition  the  given  set 
of  FDs  into  groups  such  that  all  of  the  FDs  in  each  group  have 
identical  left  sides.  Then,  for  each  group  construct  a  relation 
consisting  of  all  the  attributes  appearing  in  that  group.  The 
If^ft  side  of  the  FD  in  each  group  is  a  key  of  the  corresponding 
relation.  For  example,  see  figure  2, 

Several  undesirable  properties  of  this  method  can  be 
seen  in  the  example.  First,  the  synthesized  relations  are  not  in 
3NF,  For  example,  in  relation  FI  of  fig.  2,  C  is  transitively 
dependent  on  the  key  A.  In  F4,  P  is  partially  dependent  on  the 
key  AE,  The  unnormalized  relations  are  due  to  redundancnss  in 
the  given  set.  of  FDs.  We  will  see  later  that  f2  is  redundant  and 
that  F  is  an  extraneous  attribute  in  f6. 

Second,  the  left  sides  of  FDs  are  not  necessarily  keys 
of  the  relations,  although  they  are  always  superkeys.  In  FU,  AEF 
is  a  superkey  but  not  a  key,  since  B  is  extraneous. 

Third,  this  procedure  synthesizes  too  many  relations. 
Since  fU  and  f5  are  inverses  of  each  other,  the  relation  33  is 
extraneous.  This  results  from  a  failure  of  the  procedure  when 
constructing  P2  to  recognize  D  as  a  second  key  by  virtue  of  f5, 
rather  than  to  put  f5  into  a  separate  relation. 

To  solve  these  problems,  we  must  first  formalize  the 
concept  of  a  redundant  FD.  We  will  then  return  to  a  pres-^nta tion 
of  a  synthesis  algorithm  that  overcomes  the  above  difficulties. 
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Figure  2 


^  Schema  from  FDs 


Wg  are  given  the  following  set  of  FDs: 

f1:  A  ->  B 
f2:  A  ->  C 
f3:  B  ->  C 
f4:  B  ->  D 
f5:  D  ->  B 
ff:  ABF  ->  F 

We  group  the  FDs  according  to  common  left  hand  sides,  obtaining 
three  groups; 


gi  =  ffi,f2) 

g2  =  (f3,fU} 
g3  =  {f5} 
g4  =  {f6} 

For  each  group  we  construct  a  relation  consisting  of  all  of 
the  attributes  in  the  group: 

B  1  (A,B,C) 

R2  (B,C,D) 

B  3  (D,B) 

B4  (A^E,B,F) 

where  the  underscored  attributes  are  keys. 
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1.4  THE  ALGEBRA  AND  SEMANTICS  OF  FUNCTIONAL  DEPENDENCIES 
1.4.1  Armstrong's  Axiomat ization  of  Functional  Dependencies 

The  complete  a xioma tization  of  FDs  given  by  Armstrong 
fl]  provides  a  theoretical  background  to  the  study  of  the  algebra 
of  FDs  that  is  treated  in  later  sections,  Armstrong  shows  that  if 
a  given  set  of  FDs  exist  in  (the  extension  of)  a  relation,  then 
any  FDs  that  can  be  derived  from  the  given  set  using  the  axioms 
must  also  exist.  Armstrong  presents  several  equivalent 
a xioma tizations  of  FDs.  The  one  we  will  use  is  based  on 
properties  of  FDs  proved  by  Delobel  and  Casey  [  10].  They  are: 

A1.  (ref lexi vi ty )  X  ->  X 

A2.  (augmentation)  if  X  ->  Z  then  X+Y  ->  Z 

A.3.  (pseudo transitivity)  if  X  ->  Y  and  Y+Z  ->  w  then 
X+Z  ->  W 

where  the  symbol  means  "set  union"  (of  not  necessarily 

disjoint  sets). 

If  R(A,B)  is  a  relation,  then  axiom  A1  can  be  applied 
with  X=fA,B}  to  show  that  A,B  ->  A,B  or  with  X=  (A)  to  show 
that  A  ->  A. 

The  meaning  of  A2  is  simply  that  if  f:X  ->  Z,  then  one 

can  create  another  FD,  g,  where  the  domain  of  g  includes  X  as 

well  as  some  other  extraneous  attributes,  Y,  whose  values  have  no 
effect  on  the  value  of  Z  selected  by  g.  So,  knowing  that  ->  A, 

we  can  obtain  A,B  ->  A  (i.e.,  X=  (A)  ,  Z=  (A) ,  and  Y- {3} )  . 

Axiom  A3  is  a  substitution  rule  for  composing  FDs.  Let 

f:X  ->  Y  and  g:Y+Z  ->  W.  The  axiom  claims  that  there  is  an  h:X+Z 

->  W.  To  see  where  h  comes  from,  consider  the  application  of  h 

to  a  given  x  €  DOM  (X)  and  z  e  DOM  (Z)  in  two  steps.  First,  f  is 

applied  to  x,  yielding  a  unique  y  €  DOM(Y) ,  Second,  q  is  applied 
to  y  and  z,  yielding  a  unique  w  e  DOM (W) ,  and  thereby  completing 
the  application  of  h.  Symbolically,  we  can  say  h(x,z)  is  defined 
to  be  g(f(x),z).  Also,  note  that  in  the  statement  of  axiom  A3  if 
Z  is  the  null  set,  then  pseudotransitivity  becomes  simple 
transitivity. 

Let  G  be  a  set  of  FDs.  The  closure  of  G,  denoted  G+ ,  is 
defined  to  be  the  smallest  superset  of  G  that  is  closed  under  A1, 
A2,  and  A3.  For  a  given  G,  G-*-  can  be  shown  to  be  unique.  By 

Armstrong's  theory  we  know  that  if  G  is  a  given  set  of  FDs  for  a 

relation  R,  then  each  FD  in  G-*-  also  exists  in  R. 

An  FD  g  e  G  is  redundant  in  G  if  G-*-  =  (G-{g})+.  H  is  a 

n onred undant  covering  of  a  given  set  of  FDs,  G,  if  G*-=H-'-  and  H 

ccntains  no  redundant  FDs. 
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An  important  property  of  FDs  that  will  be  used  later  to 
prove  a  number  of  theorems  is  stated  in  lemma  1.  It  is  based  on 
the  concept  of  a  ••derivation”,  which  we  will  informally  consider 
to  be  a  series  of  applications  of  Armstrong’s  axioms  on  a  given 
set  of  FDs.  A  formal  development  appears  in  part  II, 

Lemma  J[;  Let  G  be  a  set  of  FDs,  and  let  g:X  ->  Y  be  an  FD  in  G. 
If  h:V  ->  W  is  in  G-*-  and  g  is  used  for  some  derivation  of  h  from 
G,  then  V  ->  X  is  in  G+. 

£.1221  give  here  an  intuitive  argument  usina  an  informal 
notion  of  a  derivation.  A  formal  proof  using  the  ’’derivation 
tree”  model  of  derivations  is  given  in  section  II . 1 . 2 

We  introduce  the  notation  U  ==>  Z  to  mean  that  the  FD  u 
->  Z  can  be  derived  by  an  application  of  one  of  Armstrong’s 
axioms  on  a  given  set  of  FDs.  The  notation  U  =*=>  Z  means  that  U 
->  Z  is  derivable  using  several  applications  of  the  axioms.  Now, 
the  lemma  states  there  is  a  derivation  V  =♦=>  W  using  g.  That 
is,  there  is  a  derivation  V  =♦=>  ZX  -=>  ZY  =♦=>  w  for  some 
(possible  empty)  set  of  attributes  Z  (the  step  ZX  ==>  ZY  is  the 
step  that  uses  g) ,  But  V  =♦=>  ZX  implies  V  ->  ZX,  which  implies 
V  ->  X,  thereby  proving  the  lemma,  a 

1.4,2  Uniaueness  and  the  Semantics  of  FDs 


The  treatment  of  FDs  in  this  paper  is  a  strictly 
syntactic  one  based  on  Armstrong’s  axioms.  To  use  this  approach, 
w-^  must  make  the  following  assumption  of  uniqueness:  for  a  given 
set  of  FDs  G  and  an  FD  X  ->  Y,  either  X  ->  Y  is  not  in  G*-  or 
there  exists  a  unique  FD  X  ->  Y  in  G’-.  That  is,  if  there  are  +wo 
FDs  on  the  same  set  of  attributes,  then  they  are  the  same  FD;  if 
f:X  ->  Y  and  g:X  ->  Y  then  f  is  identical  to  q.  Thus,  the  set 
of  FDs  that  are  accepted  as  input  to  the  synthesis  algorithm  is 
assumed  to  satisfy  not  only  Armstrong’s  axioms,  but  also  the 
uniqueness  assumption.  (Both  of  these  assumptions  are  also 
required  for  all  previous  syntactic  approaches  to  3NF  (e.g., 
[10,15,17]).)  That  this  uniqueness  assumption  is  quite  strong  can 
be  seen  from  several  examples. 

Let  f1:DEPT#  ->  MGR#  and  f 2 : MGB # , FLOOR  -> 
NUMBFP_o F_EMPLOYFES.  One  interpretation  of  f1  and  f2  is  that  f1 
determines  the  manager  of  each  department  and  f2  determines  the 
number  of  employees  w or  king  for  a  pa  rticular  manager  on  a 
particular  floor.  By  applying  pseudotransiti vity  to  f1  and  f2  we 
obtain  f  3  :  DEPT#,  FLOOR  ->  NnMBER_0F_Ef1PL0YEES ,  which  determines 
the  number  of  employees  of  the  manager  of  a  particular  department 
on  a  particular  floor ,  If  a  manager  can  manage  more  than  one 
department,  then  f3  is  not  the  same  as  the  syntactically 
identical  FD  g1 : DEPT#, FLOOF  ->  NUMBER_OF_EM PLOY  EES ,  which 
determines  the  number  of  employees  of  a  particular  depar tment  on 
a  particular  floor.  To  make  g1  distinct  from  f3,  one  has  to 
change  an  attribute  name  to  make  the  FDs  syntact ically  distinct. 
Fcr  example,  one  could  change  f2  and  g  such  that  f 2 : MGP# , FLOG R  -> 
NUHBEH_OF_EMPLOYEFS_OF_HANAGEF  and  g 1 : DEPT# , FLOO P  -> 
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NfJ«BEF_OF_EMPLOYEES_OF_DEPT.  Now,  g1  is  distinct  from  the 
composition  of  f1  and  f2. 

As  a  second  example,  let  f4;EMP#  ->  MGR#  and  f5:nGR#  -> 
EMP#.  It  must  be  the  case,  here,  that  fU  is  the  inverse  of  f5. 
For  if  we  compose  fU  and  f5,  we  obtain  g2:EMP#  ->  EKP#.  Since 
th=»re  is  only  one  FD  connecting  EMP#  to  EMP#  (by  our  assumption), 
and  since  by  Armstrong's  axioms  the  identity  function  must  exist, 
then  g2  must  be  the  identity  map.  This  implies  fU  =  f5“'.  If  wo 
take  the  interpretation  that  fU  maps  an  employee  into  his  manager 
and  f5  maps  a  manager's  MGR#  into  his  corresponding  EMP#,  then  of 
course  f4  /  f5-i.  So  to  take  this  interpretation,  one  must  make 
f4  and  f5  syntactically  distinct  (e.g.,  f5:MGR#  ->  EMP#_OF_MGF) . 

As  a  third  example,  let  f6: STOCK#  ->  STORE#  and 
f 7 : STOCK# , STORE#  ->  QTY,  Since  the  composition  of  ft  and  f7  is 
q3: STOCK#  ->  QTY,  i+  must  be  (by  our  assumption)  that  the 
attribute  STORE#  in  f7  is  not  needed.  Rut  suppose  f6  maps  a 
STOCK#  into  the  STORE#  of  the  store  that  is  in  charge  of  ordering 
that  item  and  f7  maps  the  STOCK#  of  an  item  and  the  STORE#  of  the 
store  in  which  it  is  being  sold  into  the  quantity  on  hand.  In 
this  case,  g3  does  not  imply  that  STORE#  is  extraneous  in  f7.  To 
prevent  this  syntactic  inference  from  taking  place,  we  must 
change  an  attribute  name  (e.g.,  f6:ST0CK#  ->  ORDERING_STORE#)  . 

In  each  of  these  examples,  a  syntactic  inference  was 
either  erroneous  or  misleading.  In  each  case,  we  solved  the 
problem  by  renaming  an  attribute  to  distinguish  it  from  another 
attribute.  This  renaming  essentially  moves  some  semantic 
knowledge  that  we  have  about  an  FD  onto  the  syntactic  level, 
where  it  can  bo  used  by  the  algebra  of  FDs. 

Specifying  a  set  of  FDs  that  can  lead  to  no  invalid 
syntactic  inferences  is  clearly  a  difficult  problem.  For  no 
syntactic  check  based  only  on  the  algebra  of  FDs  can  determine 
whether  a  given  set  of  FDs  satisfies  the  uniqueness  assumption. 
Ye'-.,  if  we  are  to  make  use  of  a  formal  algebra  of  FDs,  we  must 
make  the  assumption  that  all  syntactic  inferences  are  valid.  If 
we  had  an  automated  semantic  analyzer  that  could  judge  the 
validi'-y  of  each  syntactic  inference,  then  we  could  use  it  as  a 
sieve  to  toss  out  invalid  inferences.  Unfortunately,  such  a 
semantic  analyzer  is  well  beyond  the  state  of  the  art.  So,  we 
will  add  to  our  assumption  of  the  validity  of  syntactic 
inferences  the  proviso  that  all  syntactic  inferences  arc  (or  at 
least  can  be)  checked  for  semantic  validity.  If  an  inference  is 
invalid,  it  can  either  result  in  renaming  of  some  attributes  or 
be  simply  rejected. 

Third  normal  form  is  a  strictly  syntactic  property  that 
is  governed  by  the  algebra  of  FDs.  In  this  work  we  give  a 
complete  account  of  mapping  from  FDs  into  a  3NF  schema,  giver 
that  Armstrong's  axioms  and  the  uniqueness  assumption  are 
accepted.  Given  Armstrong's  completeness  proof,  we  believe  those 
assumptions  to  be  quite  reasonable  in  modelling  relational  data 
bases.  k'e  are  not  attacking  the  problem  of  how  to  judge  the 
semantic  validity  of  syntactic  inferences.  Semantic  problems  of 
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this  type  are  not  well  understood  and  seem  to  be  more 
than  the  syntactic  problem  of  determining  3NF.  Their 
remains  a  matter  for  further  research. 


difficult 

solution 


1.5  A  WORE  SOPHISTICATED  SYNTHESIS  PROCEDURE 
1.5.1  A  Description  of  the  Algorithm 

The  simple  synthesis  procedure  o 
problems  because  the  rules  for  composing  FDs 
main  difficulty  is  that  redundant  FDs  t 
synthesized  schema  create  extra  attributes 
unnormalized  connections  among  attributes, 
nonredundant  covering  of  the  given  set  of  FDs 
problems  can  be  alleviated.  In  fig.  2, 
redundant  and  therefore  will  not  appear 
covering  of  the  given  FDs,  thereby  avoiding 
P  n 


f  section  3.3  led  to 
were  ignored.  The 
hat  filter  into  the 
and  contribute  to 
By  first  taking  a 
,  the  normalization 
for  example,  f2  is 
in  a  nonredundant 
the  3NF  violation  of 


Finding  a  nonredundant  covering  is  not  sufficient  to 
avoid  problem  FDs  such  as  f6  in  fig,  2.  This  further  problem  can 
be  eliminated  by  excising  extraneous  attributes  from  the  left 


sides  of  FDs.  An  attribute  Ai  is  extraneous  in 
g:A1,,..,Ap  ->  B,  if  AI , . . . , Ai- 1 , Ait  1 , . . . , Ap 

Eliminating  extraneous  attributes  helps  to 
dependencies  and  superkeys  that  are  not  keys, 
fig.  2. 


a  n  F  D  g  e  G , 
->  B  is  in  G-*- . 
avoid  partial 
such  as  in  RU  of 


If  two  relations  have  keys  that  are  functionally 
dependent  upon  each  other  (i.e.,  are  equivalent),  then  the  two 
r<=‘lations  can  be  merged  together.  This  can  be  accomplished  in  the 
synthesis  procedure  by  merging  together  two  groups  of  FDs  if 
their  left  sides  are  functionally  equivalent.  For  example,  g2  and 
g3  in  fig.  2  can  be  merged  into  a  single  group. 

Algorithm  1  (see  figure  3)  includes  the  above 
improvements.  In  the  seguel,  we  will  refer  to  Algorithm  1  with 
step  4  excised  as  Algorithm  la. 


A  linear  time  algorithm  for  testing  membership  in  the 
closure  of  a  set  of  FDs  is  presented  in  part  II. 
shown  there  that  using  this  procedure,  one 
Algorithm  1  with  a  time  bound  of  0(L2),  where  L  is 
the  string  encoding  the  given  set  of  FDs. 

1.5.2  C om_ple te n e ss  of  the  Synthesized  Schema 


It  is  also 
can  implement 
length  of 


the 


A  schema 

the  FDs  embodied  in 
Algorithm  1  synthesizes 


S  represents  a  set  of  FDs  F  if  the  closure  of 
the  relations  of  S  equals  F+.  To  show  that 
a  schema  that  represents  the  given  FDs, 
consider  a  set  of  FDs  F  that  is  given  as  input  to  Algorithm  1, 
Let  H  be  the  set  of  FDs  that  result  from  eliminating  extraneous 
attributes  and  redundant  FDs.  Clearly,  H+  still  equals  F'*-.  Let 
S  be  a  schema  synthesized  from 
FDs  embodied  in  S  and  H*  =  F+,  S 


F.  Since  H 
represents 


is  exactly  the  set  of 

F. 
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Figure  3 


S yn thesizir.q  a  Relational  ^hema  from  a  Set  of  FDs 


1.  (Flimina^e  extraneous  attributes)  Let  F  be  the  given  set  of 
FDs.  Eliminate  extraneous  attributes  from  the  left  side  of 
each  FD  in  F,  producing  the  set  G.  An  attribute  is 
extraneous  if  its  elimination  does  not  alter  the  closure  of 
the  set  of  FDs. 

2.  (Find  covering)  Find  a  nonredundant  covering,  H,  of  G, 

3.  (Partition)  Partition  H  into  groups  such  that  all  of  the  FDs 
in  each  group  have  identical  left  sides, 

4.  (Merge  eguivalent  keys)  For  each  pair  of  groups,  say  HI  and 
H2,  with  left  sides  X  and  Y  respectively,  merge  HI  and  H2 
together  if  there  is  a  bijection  X  <->  Y  in  H+, 

5.  (Construct  relations)  For  each  group,  construct  a  relation 
consisting  of  all  the  attributes  appearing  in  that  group. 
Each  set  of  attributes  that  appears  on  the  left  side  of  any 
FD  in  the  group  is  a  key  of  the  relation,  (Step  1  guarantees 
that  no  such  set  contains  any  extra  attributes.)  All  keys 
found  by  this  algorithm  will  be  called  synthesized.  The  set 
of  constructed  relations  constitutes  a  schema  for  the  given 
set  of  FDs. 
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We  would  like  to  be  certain  that  the  extension  of  every 
FD  in  the  given  set  F  can  be  retrieved  from  the  extension  of  the 
synthesized  schema  S  using  relational  algebra.  We  will  argue 
that  this  follows  from  the  fact  that  S  represents  F, 


Consider  some  f:X  ->  A  6  F.  We  begin  by  noting  that 
extraneous  attributes  in  X  can  be  ignored.  That  is,  if  the 
extension  of  an  FD  f*:X*  ->  A  where  X*  c  X  can  be  retrieved  from 
the  extension  of  S,  then  since  f  can  be  obtained  from  f’  simply 
by  augmentation,  we  can  treat  f’  to  be  the  same  FD  as  f.  Now, 
since  S  represents  F,  there  is  a  derivation  for  f  based  on  the 
set  of  FDs,  H,  that  are  embodied  in  S.  In  part  II,  we  show  that 
if  f*  has  no  extraneous  attributes,  then  it  can  be  derived  from  H 
using  only  the  pseudotr ansi vi t y  axiom.  Since  an  application  of 
psuedotransitivity  corresponds  exactly  to  a  join  in  relational 
algebra,  the  derivation  for  f*  from  H  can  be  simulated  by  a 
sequence  of  joins  on  the  extension  of  the  relations  of  S.  In 
this  way,  the  extension  of  every  f  e  F  can  he  retrieved  from  the 
extension  of  S  using  relational  algebra.  That  is,  our  notion  of 
•representation*  satisfies  the  intuition  that  all  relationships 
specified  in  the  given  set  of  FDs  are  actually  retrievable  from 
the  extension  of  the  synthesized  schema. 

T.5.3  Nonfunctional  gelation  ships 

We  introduced  a  special  notation  for  representing 
nonfunctional  relationships  in  our  input  FDs.  We  must  now  make 
sure  that  these  FDs  behave  in  the  expected  way. 

If  X  ->  9  is  in  the  set  of  FDs  given  to  Algorithm  1, 
then  either  X  ->  9  or  Y  ->  9,  where  Y  <->  X,  appears  in  the 
schema  synthesized  by  the  algorithm.  This  is  a  consequence  of 
the  following  lemma,  which  is  proved  in  part  II.  Thus,  the 
nonfunctional  relationships  appear  in  the  schema  in  nearly  the 
same  form  that  they  are  specified  in  the  given  set  of  FDs, 

Lemma  2;  If  X  ->  9  is  in  a  set  of  FDs  G,  then  for  any 
nonredundant  covering  H  of  G,  either  X  ->  9  is  in  H  or  Y  ->  9  is 
in  H,  where  X  ->  Y  and  Y  ->  X.  n 


By  the  above  lemma,  the  9  attributes,  which  were 
invented  to  permit  the  representation  of  nonfunctional 
relationships,  always  appear  in  the  synthesized  schema.  How  are 
they  interpreted?  To  see  this,  consider  the  following  example. 
Suppose  two  nonfunctional  relationships  were  specified  in  the 
given  set  of  FDs:  f1:AB  ->  91  and  f2:AB  ->  92.  (Notice  again 
that  the  uniqueness  assumption  of  FDs  does  not  force  uniqueness 
of  nonfunctional  relationships  between  A  and  E.)  Step  4  of 
Algorithm  1  merges  these  two  FDs  into  a  single  group,  yielding  a 
relation  F.  (A^B, 9 1 , 92)  ,.  In  order  to  distinguish  whether  a  given 
pair  of  values  for  A  and  E  satisfy  f1,  f2,  or  both  f1  and  f2,  the 
91  and  92  attributes  must  be  retained.  For  example,  <a,  b,  0, 
1>  e  F  means  a,b  satisfies  f2  but  not  f1.  Notice  that  if  there 
is  only  one  nonfunctional  relationship  among  a  set  of  attributes, 
then  the  9  attribute  can  generally  be  dropped,  since  this  problem 
of  distinguishing  among  relationships  disappears.  For  example. 
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if  only  f1  were  present,  then  it  is  customary  only  to  include 
<a,b>  pairs  that  are  related  under  f 1 ;  a  tuple  <a,b,0>  would 
normally  not  be  included  in  the  extension.  Therefore,  in  this 
case,  the  0  attribute  would  be  dropped  altogether. 


1.6  THIFD  NOPMAL  FOEM  SCHEMAS 

1.6.1  Introduction 

In  this  section  we  show  under  what  conditions  various 
synthesized  relations  are  in  3NF.  We  begin  by  showing  tha+ 
Algorithm  la  (i.e,.  Algorithm.  1  without  step  U)  always  produces  a 
3NF  schema.  We  th^^n  examine  A.lgorithra  1.  A  property  of 
derivations  of  nonprime  attributes  is  introduced  and  shown  to  be 
a  sufficient  condition  for  Algorithm  1  to  produce  a  schema  in 
3NF.  Unfortunately,  there  are  cases  of  FDs  that  do  not  satisfy 
this  property  and  therefore  can  lead  to  relations  with  transitive 
dependencies.  One  such  example  is  presented  and  is  shown  to  oe  a 
counterexample  to  a  theorem  given  by  Delobel  and  Casey. 

1.6.2  A1  c[ o ri t h n  _1  a  Schemas 

To  prove  that  every  relation  synthesized  by  Algorithm  la 
is  in  3NF,  we  show  that  a  transitive  dependency  implies  the 
existence  of  a  redundant  FD  in  the  nonredundant  covering.  We  will 
use  lemma  1  (cf,  section  h,1)  to  show  the  existence  of  an  FD  that 
creates  the  contradiction.  Lemma  1  will  be  used  in  this  way  in 
all  succeeding  3NF  proofs. 

Theorem  _1:  Let  F.  (A1  ,  .  .  ,  ,  An  )  be  a  relation  synthesized  from  the 
set  of  FDs  F  using  Algorithm  1a.  Then  no  nonprime  attribute  of  F 
is  transitively  dependent  upon  any  key  of  R.  That  is,  F  is  in 
3  N  F . 

£122^  Suppose  Ai  is  nonprime  and  is  transitively  dependent  upon 
a  key,  K,  of  F.  (K  need  not  be  synthesized.)  That  is,  there  is  an 
X  c  fA1,...,An}  such  that  K  ->  X,  X  />  K,  and  X  ->  Ai  are  in  F+ , 
and  Ai  is  not  in  X. 

We  first  observe  that  Ai  is  transitively  dependent  u pon 
the  synthesized  key  of  F,  Let  Z  be  the  key  of  E  that  appears  on 
the  left  side  of  the  FDs  that  were  used  in  synthesizing  F. 
Clearly,  Z  ->  X  is  in  F+.  Furthermore,  X  />  Z.  For  if  X  ->  Z, 
thf^n  X  ->  Z  and  Z  ->  K  would  imply  X  ->  K,  contradicting  X  />  K 
in  th“  original  transitive  dependency.  Hence,  Z  ->  X,  X  />  Z,  and 
X  ->  Ai  is  also  a  transitive  dependency. 

Let  H  bo  the  nonredundant  covering  of  G  computed  in 
Algorithm  la.  We  will. now  show  that  Z  ->  Ai,  which  appears  in  H, 
is  redundant.  To  do  this,  it  is  sufficient  to  show  that  Z  ->  X 
and  X  ->  Ai  can  both  he  derived  from  H  -  [Z->Ai} . 

Since  the  only  FDs  used  in  synthesizing  R  are  of  the 
form  Z  ->  Aj,  it  must  he  that  Z  ->  A  is  in  H  for  all  A  S  X. 
Since  Ai  is  not  in  X,  Z  ->  A  is  in  H  -  (Z  ->  Ai}  for  all  A  €  X. 
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Suppose  there  is  a  derivation  for  X  ->  Ai  in  H  that  uses 
Z  ->  Ai.  Then,  by  lemma  1  we  have  X  ->  Z.  But  this  violates  X  /: 
Z  in  the  transitive  dependency.  So  X  ->  Ai  must  be  derivable 
without  using  Z  ->  Ai , 

Since  Z  ->  X  and  X  ->  Ai  can  both  be  derived  from  H  -  fZ 
->  Ai} ,  it  must  be  that  Z  ->  Ai  is  redundant  in  H,  contradicting 
the  fact  that  H  is  non redu ndan t.  But  this,  in  turn,  must  mean 
that  the  transitive  dependency  did  not  exist.  n 


The  above  theorem  was  first  presented  by  Hang  and 
Wedekind  [17],  however  their  proof  was  incorrect  [h].  In  the 
proof,  they  only  argued  that  the  transitive  dependency  was 
derivable  in  H,  not  H-{Z->Ai},  In  terms  of  the  above  proof,  they 
claimed  that  if  K  ->  X  and  X  ->  Ai  is  a  transitive  dependency, 
then  K  ->  Ai  is  derivable  by  pseud otran si ti vity.  This,  they 
asserted,  violates  the  fact  that  K  ->  Ai  is  in  a  nonredundant 
covering.  However,  the  latter  is  only  true  if  one  can  show  that 
both  K  ->  X  and  X  ->  Ai  are  derivable  from  the  closure  without 
using  K  ->  Ai.  For  example,  G  =  (A,  ->  B,  B  ->  A,  A  ->  C}  is  a  set 
of  FDs  where  A  ->  B  and  B  ->  C  are  in  the  closure  but  A  ->  C  is 
not  redundant,  because  B  ->  C  cannot  be  derived  from  G  without  A 
->  C.  In  any  case,  their  theorem  was  correct  as  stated,  and  the 
above  argument  fixes  their  proof,  using  lemma  1  and  the  important 
fact  that  X  y>  K  in  the  transitive  dependency. 

It  is  interesting  to  note  that  they  did  not  eliminate 
superkeys  in  their  version  of  Algorithm  la.  This  was  not  an 
error,  since  they  explicitly  assu  med  that  extraneous  attributes 
did  net  exist  on  the  left  sides  of  FDs.  However,  one  need  not 
make  this  general  assumption,  since  some  extraneous  attributes 
can  be  eliminated  algorithmically.  In  fact,  to  be  enti rely 
consistent  with  the  algebra  of  FDs,  one  must  eliminate  such 
extraneous  attributes.  Of  course,  not  all  such  extraneous 
attributes  can  be  eliminated  in  this  way;  many  semantic  errors 
must  remain  the  user's  responsibility  for  reasons  discussed  in 
section  4.2. 

One  might  expect  the  proof  of  Theorem  1  to  generalize  to 
schemas  synthesized  by  Algorithm  1.  Unfortunately,  this  is  not 
the  case.  A  schema  that  is  not  in  3NF  can  be  synthesized  by 
Algorithm  1,  as  shown  in  figure  4  (ii) ,  In  the  next  section  we 
will  add  a  further  precondition  that  is  sufficient  to  guarantee 
3NF  for  schemas  produced  by  Algorithm  1. 

1.6.3  A  Sufficient  Condition  for  3NF 

To  gaurantee  that  no  nonprime  attribute  is  transitively 
df^per.dent  upon  any  key  of  F,  we  will  use  the  following  property: 

An  attribute  A  is  said  to  satisfy  property  P  in  relation  R  if 
the  following  proposition  holds;  Let  H  be  the  nonredundant 
covering  produced  from  step  2  of  Algorithm  1.  If  K  ->  A  is 
in  H  and  K  ->  A  is  used  in  synthesizing  R,  then  for  any  prime 
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attribute  B  of  R,  the  FD  K  ->  B  can  be  derived  without  using 
K  ->  A  (i.e.  ,  can  be  derived  in  H  -  {K  ">  ■?'})• 

Property  P  is  strictly  a  syntactic  property  of  derivations  of 
FPs,  and  to  our  knowledge  has  no  semantic  interpretation  in  terms 
of  real  world  relationships.  It  is  the  weakest  property  we  know 
of  that  is  sufficient  to  guarantee  that  Algorithm  1  produces  3NF 
relations.  The  proof  that  property  P  is  sufficient  to  guarantee 
3NF  follows  the  same  lines  as  the  proof  of  theorem  1. 

Theorem  2;  let  P(A1,,.,,An)  be  one  of  the  relations  synthesized 
using  Algorithm  1  from  a  set  of  FDs,  F.  If  all  nonprime 
attributes  of  R  satisfy  property  P,  then  R  is  in  3NF. 

Proof  Let  Ai  be  a  nonprime  attribute  of  P  that  is  transitively 
dependent  upon  some  key  Y  of  P,  That  is,  there  is  a  Z  c 
(A1,.,.,An}  such  that  Y  ->  Z,  Z  />  Y,  and  Z  ->  Ai  with  Ai  not  in 
Z.  Let  H  be  the  nonredundant  covering  computed  in  Alg .  1  and  let 
K  be  a  key  such  that  h:K  ->  Ai  is  in  H,  That  is,  h  is  an  FD  that 
brought  Ai  into  R  by  Alg.  1. 

Since  K  is  a  key,  K  ->  Z.  Furthermore,  Z  />  K,  For  if 
Z  ->  K,  then  Z  ->  K  and  K  ->  Y  implies  Z  ->  Y,  a  contradiction. 
Sc,  we  have  a  new  transitive  dependency:  K  ->  Z,  Z  />  K,  and  Z  -> 
Ai.  We  want  to  show  that  K  ->  Z  and  Z  ->  Ai  are  in  (H  -  [K  -> 
Ai})+  to  establish  a  contradiction  that  H  is  redundant. 

Let  Z  =  [Bl,.,,,Em}.  We  distinguish  two  cases.  If  B-j 
is  orime,  then  property  P  guarantees  that  K  ->  Bi  is  derivable 
without  using  K  ->  Ai .  if  Pj  is  not  prime,  then  there  is  an  FD 
K*  ->  Bj  that  brought  Pj  into  R.  Since  K  ->  K'  is  derivable  (by 
property  P)  from  H  -  [K  ->  Ai}  and  K*  ->  Bj  is  in  H  -(K  ->  Ai} , 
we  obtain  that  K  ->  Bj  is  derivable  without  using  K  ->  Ai. 
Hence,  K  ->  Z  is  in  ( H- fK->Ai} ) ♦ . 

Now,  assume  Z  ->  Ai  uses  K  ->  Ai  in  its  derivation. 
Then  by  lemma  1,  Z  ->  K,  contradicting  the  transitive  dependency. 
Hence,  Z  ->  Ai  is  in  (H  -  {K  ->  Ai})-*-. 

The  FDs  K  ->  Z  and  Z  ->  Ai  are  in  (H  -  (K  ->  Ai})«-, 
establishing  that  H  is  redundant,  a  contradiction.  Hence  the 
transitive  dependency  could  not  have  existed,  a 


The  need  for  property  P  arises  from  the  merging  of 
equivalent  keys  in  s-^ep  U  of  Algorithm  1,  Suppose  K1  and  K2  are 
merged  in  step  4  because  K1  <->  K2,  and  suppose  K1  ->  7,  Z  />  K1, 
Z  ->  A  is  a  transitive  dependency  in  the  synthesized  relation. 
This  transitive  dependency  would  not  exist  if  K1  and  K2  were  the 
keys  of  two  separate  relations,  as  would  be  the  case  using 
Algorithm  la.  One  way  the  transitive  dependency  can  arise  is 
that  there  is  a  Zi  e  Z,  with  K2  ->  Zi  (and  K1  ->  A)  in  the 
nonredundant  covering,  but  K1  ->  Zi  (and  K2  ->  A)  not  in  the 
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Figure  4 


Examples  of  Violations  of  Property  P 

FDs  Schema  Synthesized 

Algorithm  1, 


f1:  XI, X2  ->  A 
f2:  C  ->  XI, X2 
f3:  A, XI  ->  B 
fU:  B,X2  ->  C 


HI (X1^X2,C,A) 

F2(AxXl.,B) 


(from  f1  f>  f2) 


A  does  not  satisfy  property  P,  yet  R1  is  in  3NF. 


(i) 


g1 :  XI  ,X2  ->  A,D 
g2:  C,D  ->  XI  X2 
q3:  A, XI  ->  B 
gU:  B,x2  ->  C 
gS:  C  ->  A 


SI {Xl^X2,CiD,A)  (from  g1  &  g2) 

S2  (A^X2,B) 

S3(B^X2,C) 

S4(C,A) 


SI  is  not  in  2NF,  since  A  is  partially  dependent 

upon  the  key  CO. 


(ii) 
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covering.  Thus,  a  relation  must  contain  both  K1  and  K 2  to 
manifest  the  transitive  dependency.  The  FD  K1  ->  Z  in  the 
transitive  dependency  is  a  compostion  of  K1  ->  K2  ->  Z.  If  A. 
does  not  have  property  P,  then  K1  ->  A  may  be  necessary  to  obtain 
K1  ->  K2,  in  which  case  K1  ->  A  need  not  be  redundant  (as  it 
would  be  in  Algorithm  la).  However,  if  A  does  have  property  P, 
then  K1  ->  K2  does  not  need  K1  ->  A,  so  K1  ->  A  is  redundant,  and 
we  have  the  theorem. 

Consider  the  set  of  FPs  in  figure  4(i)  which  produces 
the  relation  R1(XJ_j.X2,  C,  A)  via  Algorithm  1.  (The  reader  can 
check  that  XI, X2  ->  C  is  in  the  closure  of  the  given  FDs.)  The 
attribute  A.  is  nonprime  in  R1  and  does  not  satisfy  property  P, 
since  the  only  way  to  derive  XI, X2  ->  C  is  using  XI, X2  ->  A. 
However,  despite  the  violation  of  property  ?,  relation  R1  is  in 
3NF.  Hence,  property  P  is  not  a  necessary  condition  for  3NF. 

Figure  4  (ii)  presents  an  example  of  FDs  that  exhibit  the 
same  violation  of  property  P  as  figure  4(i)  but  induce  a  partial 
(and,  hence,  a  transitive)  dependency.  In  terms  of  the  above 
discussion  regarding  transitive  dependencies,  we  have  XI, X2  ->  C, 
C  />  XI, X2,  and  C  ->  A;  but  XI, X2  ->  A  is  not  redundant,  since 
XI, X2  ->  C  needs  XI, X2  ->  A  in  its  derivation. 

Property  P  affects  one  other  published  procedure  that 
synthesizes  relations  from  FDs.  Delobel  and  Casey  [10]  claim  that 
their  decomposition  procedure,  which  is  in  some  sense  comparable 
to  our  Algorithm  1,  produces  3NF  relations.  Their  claim,  however, 
is  incorrect  in  that  the  example  in  figure  4  (ii)  falsifies  their 
theorem. 

1.6,4  Putting  delations  into  3NF 

A  violation  of  property  P  may  induce  a  3NF  violation. 
Once  a  particular  violation  of  3NF  is  found,  then  to  put  the 
relation  into  3NF  the  offending  dependency  must  be  removed. 
Conveniently  enough,  if  a  nonprime  attribute  is  transitively 
dependent  upon  a  key  of  a  relation,  then  the  attribute  can  simply 
be  removed  from  the  relation,  and  the  resulting  schema  will  still 
embody  the  same  FDs. 

Theorem  3:  Let  Rk (A1 , . , . , An)  be  a  relation  in  a  schema 
5=  {P 1 , . . . , Rm)  that  was  synthesized  using  Algorithm  1.  Let  H  be 
the  set  of  FDs  embodied  in  S.  Let  Ai  be  an  attribute  of  Rk  that 
appears  in  none  of  Fk*s  synthesized  keys,  and  let  Ai  be 
transitively  dependent  upon  a  key  of  Rk,  Suppose  Ai  is  removed 
from  Rk,  resulting  in  a  new  relation  Rk’  and,  hence,  a  new  schema 
S*  =  f R 1  ,,,., R k Fm) .  Then  the  closure  of  the  set  of  FDs 
embodied  in  S*  equals  H*. 

Suppose  Ai  is  removed  from  Ek.  Since  Ai  does  not  appear  in 
any  of  the  keys  synthesized  by  A.lg.  1,  its  removal  can  only 
affect  embodied  FDs  of  the  form  f:X->Ai,  where  X  is  a  synthesized 
key  of  Rk.  Let  H*  be  the  set  of  FDs  embodied  in  S'.  If  we  show 
that  all  such  f  are  in  (H*)'-,  then  H'-  =  (H')+. 
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By  the  same  argument  used  in  the  proofs  of  theorems  1 
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Theorem  3  provides  us  with  a  simple  means  of  removing  an 
unwanted  transitive  dependency.  Namely,  excise  the  offending 
attribute  from  the  relation.  The  theorem  guarantees  that  the 
resulting  schema  still  represents  the  given  set  of  FDs. 
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Figure  5 


Algorithm  2 

Sinthesi2in£  a  Relational  Schema  from  a  Set  of  FDs 


1  .  (Eliminate  extraneous  attributes)  Let  F  be  the  given  set  of 
FDs.  Eliminate  extraneous  attributes  from  the  left  side  of 

each  FD  in  F,  producing  the  set  0.  An  attribute  is 

extraneous  if  its  elimination  does  not  alter  the  closure  of 
the  set  of  FDs. 

2.  (Find  covering)  Find  a  nonredundant  covering,  H,  of  G. 

3.  (Partition)  Partition  H  into  groups  such  that  all  of  the  FDs 
in  each  group  have  identical  left  sides. 

4.  (Kerge  equivalent  keys)  Let  J  =  0,  For  each  pair  of  groups, 

say  Hi  and  Hj,  with  left  sides  X  and  Y  respectively,  merge  Hi 
and  H2  together  if  there  is  a  bijection  X  <->  Y  in  H^-.  For 
each  such  bijection,  add  X  ->  Y  and  Y  ->  X  to  J.  For  each  A 

€  Y  if  X  ->  A  is  in  H,  then  delete  it  from  H.  Do  the  same 

for  each  Y  ->  B  in  H  with  B  8  X. 

5.  (Eliminate  transitive  dependencies)  Find  an  H*  c  H  such  that 
(H’+J)-*-  =  (H+J)  +  and  no  proper  subset  of  H*  has  this 
property.  Add  each  FD  of  J  into  its  corresponding  group  of 

H  •  . 

6.  (Construct  relations)  For  each  group,  construct  a  relation 
consisting  of  all  the  attributes  appearing  in  that  group. 
Each  set  of  attributes  that  appears  on  the  left  side  of  any 
FD  in  the  group  is  a  key  of  the  relation.  (Step  1  guarantees 
that  no  such  set  contains  any  extra  attributes.)  All  keys 
found  by  this  algorithm  will  be  called  synthesized.  The  set 
of  constructed  relations  constitutes  a  schema  for  the  given 
set  of  FDs. 
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Steps  1-4  of  Algorithm  2  are  effectively  implemented  as 
in  Algorithm  1,  Step  5  can  be  effectively  implemented  using  the 
membership  algorithm  presented  in  part  II.  Algorithm  2  can  then 
be  implemented  in  the  same  0(L2)  time  bound  as  Algorithm  1.  For 
details  see  section  II. 4. 


1.7  PROOF  CF  MINIHALITY 

The  purpose  of  this  section  is  to  examine  the  number  of 
relations  synthesized  by  Algorithm  2  (or  1)  for  a  given  set  of 
FDs,  compared  with  any  other  relational  schema  that  represents 
those  FDs.  y  We  will  show  that  ail  nonredundant  coverings 
generate  the  same  number  of  relations,  by  showing  the  number  of 
equivalence  classes  of  synthesized  keys  to  be  the  same  across  all 
nonredundant  coverings  of  a  given  set  of  FDs.  This  will  then 
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Lemma  2  cannot  be  strengthened  so  that  X=Y. 
one  can  have  two  nonredundant  coverings  with  equivalent  c 
such  that  the  two  coverings  have  different  lef 
representing  a  key  eguivalence  class.  For  example,  in  f 
g3  and  h3  have  functionally  equivalent  left  sides,  sine 


DE;  yet  CFj'DE. 


That  is, 
losu  res , 
t  sides 
igure  6 
e  CF  <-> 


Using  lemma  3  and  recognizing  that  Algorithm  2 
synthesizes  a  relation  from  each  maximal  group  of  FDs  that  have 
functionally  equivalent  left  sides,  we  can  now  see  that  all 
nonredundant  coverings  of  a  given  set  of  FDs  produce  the  same 
number  of  relations  by  Algorithm  2. 
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Figure  6 

!I)i2  Equivalent  Coverings  with  different  Keys 

G  =  {q1:C  ->  D, 
g2: D  ->  C, 
g3:CE  ->  F} 

H  =  {h1:C  ->  D, 
h2:D  ->  C, 
h3;DE  ->  F} 

G  and  H  ar®  nonredundant  and  G  +  =  H-*-,  Yet  g3  and  h3  generate 
different  relations.  This  is  an  example  of  lemma  3  where  X/^Y . 
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'theorem  4:  Let  F  be  a  set  of  FDs.  Any  two  nonredundant  coverings 
oT  F^will  produce  the  saae  nuiber  of  relations  via  Algorithm  2. 

Proof  Let  G1  and  G2  be  two  nonredundant  coverings  of  F.  By 
lemma  3,  if  an  FD  g:X  ->  A  is  in  G1,  then  there  is  an  h:Y  ->  B  in 
G2  with  X  ->  Y  and  Y  ->  X.  Thus,  for  any  group  of  FDs  in  G1  with 
functionally  equivalent  left  sides,  there  must  be  exactly  one 
such  group  in  G2,  namely,  the  one  that  has  the  same  functionally 
equivalent  left  sides.  Since  each  such  group  generates  one 
relation,  G1  and  G2  must  produce  the  same  number  of  relations, n 

Theorem  4  states  that  all  choices  of  nonredundant 
coverings  are  equally  good  in  terms  of  number  of  relations 
synthesized.  This  is  somewhat  surprising  in  that  it  contradicts 
the  intuition  that  perhaps  a  minimal-sized  nonredundant  covering 
would  produce  fewer  relations  than  other  larger  nonredundant 
coverings . 


The  theorem  also  shows  that  on  the  logical  level  there 
is  not  very  much  choice  as  to  how  to  pick  relations  that  cover 
the  given  set  of  FDs.  Some  of  the  decomposition  approaches  (e.g,, 
f 10,15,17])  claim  to  allow  the  system  to  choose  among  a  class  of 
possible  schemas,  directing  the  choice  by  efficiency 
considerations.  Since  all  coverings  have  the  same  set  of 
equivalence  classes  of  keys,  the  class  of  possible  schemas  is 
really  quite  small.  Hence,  if  one  is  guided  on  the  logical  level 
by  normalization  considerations  rather  than  by  efficiency 
considerations,  one  arrives  at  a  set  of  nearly  identical  possible 
schem  as. 


From  theorem  4,  we  can  see  that  the  number  of  relations 
generated  by  Algorithm  2  is  minimal  among  all  those  that  embody 
the  same  given  set  of  FDs.  This  gives  a  complete  characterization 
cf  the  optimal  3NF  schemas  discussed  in  [7], 

Corol lary ;  Let  S  be  a  schema  synthesized  from  a  set  of  FDs  F 
using  Algorithm  2.  Let  S*  be  any  schema  representing  a  set  of  FDs 
G  that  covers  F,  Then  |S»|  >  |S|, 

h  c  G  be  a  nonredundant  covering  of  F,  Certainly  H 
will  generate,  via  Algorithm  2,  no  more  relations  than  are  in  S*. 
Furthermore,  by  theorem  4,  Algorithm  2  will  generate  the  same 
number  of  relations  from  G  as  from  F.  Hence,  1S*|  >  1S|,  n 


1.8  CONCLOSICN 

The  purpose  of  this  part  of  the  paper  was  to  develop  an 
algorithm  for  synthesizing  a  3liF  schema  from  a  given  set  of  FDs 
and  to  examine  some  properties  of  such  schemas.  The  main  results 
were : 

1.  Certain  simple  algorithms  for  synthesizing  schemas 
either  produce  too  many  relations  or  violate  3NF. 
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2.  An  algorithm  that  synthesizes  £rovably  3NF  schemas 
presented.  The  essential  aspect  of  this  algorithm 
that  it  eliminates  as  much  redundancy  as  possible 
the  given  set  of  FDs. 

3.  All  nonredundant  coverings  produce  the  same  numbe 
relations  using  this  latter  method.  Hence,  synthes 
schemas  contain  a  minimal  number  of  relations. 


This  is  the  first  successful  attempt,  to  our  hnowle 
of  implementing  Codd's  normalization  pcocedure  [7]  both  prov 
and  effectively.  (Errors  in  two  earlier  similar  attempts 
isolated.)  Furthermore,  by  the  corollary  to  theorem  u, 
sythesized  relations  satisfy  Codd's  optimality  criterion  - 
other  schema  covering  the  same  FDs  has  fewer  relations. 
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P^EI  II  III  problem 


I I . 1  INTRODUCTION 


Re  have  presented  in  part  I  an 
synthesizing  3NF  relational  schemas  from  FDs,  Th 
the  algorithm  depends  on  whether  it  can 
efficiently.  It  turns  out  that  the  problems  of 
attribute  in  a  left  side  of  an  FD  is  extraneous 
given  set  of  FDs  is  redundant  are  instances 
problem  --  to  decide  if  a  given  FD  is  in  the  cl 
set  of  FDs.  We  call  this  latter  problem  the  membership 
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IT. 2  DERIVATION  TREES 

I I. 2.1  Derivation  Trees  as  a  Model  for  Derivations 

Armstrong  [1]  proved  that,  given  the  FDs  X  ->  E1,...,X 
->  Bk,  the  FD  X  ->  can  be  derived  using  the  axioms  A1, 

A2,  A3.  It  is  trivial  that  each  of  the  FDs  X  ->  B1,...,X  ->  Bk 

is  derivable  from  X  ->B1,...,Bk.  Thus,  the  FD  X  ->  R1,...,Bk  is 

equivalent  to  the  set  {X  ->E1,.,,,X  ->  Bk) .  In  this  section  we 
will  regard  an  FD  of  the  form  X  ->  Bl,...,Bk  merely  as  a 
representation  of  the  FDs  X  ->  B1,...,X  ->  Bk. 

Let  X  be  a  set  of  attributes,  let  G  be  a  set  of  FDs  over 
X,  and  let  q:B1,..,,Bk  ->  C  be  an  FD  over  X.  If  g  €  G-*-,  then 
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there  is  a  sequence  of  applications  of  axioms  A1,  A2,  and  A3  on  G 
that  yields  g.  In  this  section  we  will  develop  a  graph  model, 
called  a  derivation  tree,  for  such  a  sequence  of  applications  of 
the  axioms. 

Let  G  be  a  set  of  FDs.  G- based  derivation  trees  (abbr. 
G-based  DT)  are  formally  defined  as  follows: 

1.  If  C  is  an  attribute,  then  the  labelled  node  C  is  a  G- 
based  DT. 

2.  If  T  is  a  G-based  DT  with  C  as  a  leaf  node,  and 
f:B1,.,.,Bra  ->  C  is  an  FD  in  G,  then  the  tree  constructed 
from  T  by  adding  B1,...,Bm  as  children  of  the  leaf  node  C  is 
also  a  G-based  DT, 

The  derivation  tree  is  a  simple  model  for  the  successive 
composition  of  FDs  by  pseudotransitivity  {this  is  formalized 
below),  A  sample  derivation  tree  construction  is  given  in  figure 

7. 

A  DT  is  characterized  by  its  root,  by  its  leaf  set  and 
by  the  FDs  that  appear  in  it.  We  will  abbreviate  the  expression 
’’a  DT  whose  leaf  set  is  contained  by  (A1,..,,An}”  by  ”an 
fA1 , . . . , An}-DT" .  If  T  is  an  X-DT  rooted  at  A  then  we  call  it  a 
"derivation  tree  for  the  FD  X  ->  A".  (This  terminology  will  be 
justified  by  theorem  5  below.) 

The  following  lemma  is  an  important  step  towards  a 
formal  characterization  of  the  connection  between  derivation 
trees  and  derivations  of  FDs.  It  is  a  formal  restatement,  using 
DTs,  of  lemma  1  which  was  used  extensively  in  proving  the  main 
theorems  of  part  I. 

L£]!li!l§.  if*  T  be  a  G-based  derivation  tree.  Let  X  be  a 
nonempty  subset  of  the  nodes  of  T  and  let  Y  be  the  set  of  all 
attributes  that  appear  as  leaves  of  T.  Then  Y  ->  X  is  in  G+. 

£122^  Consider  first  the  case  that  X  is  simply  the  root  node. 
This  sublemma  can  be  proved  by  induction  on  the  number  of  FDs 
that  are  added  to  the  DT  (i.e.,  applications  of  (2)  above).  This 
follows  directly  since  each  such  addition  preserves  the  desired 
property  that  the  root  is  functionally  dependent  upon  the  set  of 
leaves  by  virtue  of  the  psued otra nsit ivit y  rule. 

Now  suppose  Xi  €  X  is  any  internal  node  of  T.  Since  Xi 
roots  a  Y-DT,  by  the  above  sublemma  we  have  Y  ->  Xi,  By  the 
observation  at  the  beqining  of  the  section,  if  Y  ->  Xi  for  all  Xi 
€  X,  then  Y  ->  X,  completing  the  proof,  n 


To  make  the  DT  model  complete  with  respect  to 
Armstrong's  axioms  we  have  to  consider  axioms  A1  ( ref lexiv i t y) 
and  A2  (augmentation)  as  well,  Fxcept  for  FDs  of  the  form  X->X, 


) 
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Figure  7 

b  Sample  Derivation 

Given:  G  =  {g1:AD->C;  q2:C->D;  g2:DE->F;  f4:A->E} 

Show:  f:AB  ->  F  €  G<- 

FD  fJsed  Derivation  Current  FD 
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any  FD  that  can  be  derived  with  reflexivity  and 
psuedotransitivi t y  can  also  be  derived  without  reflexivity  simply 
by  eliminating  all  the  applications  of  axiom  A1.  Therefore,  we 
may  assume  without  loss  of  generality  that  reflexivity  is  not 
used  in  derivations  except  for  deriving  FDs  of  the  form  f:X->X. 
Similiarly,  in  a  DT,  reflexivity  corresponds  to  taking  a  leaf 
node,  making  a  copy  of  it,  and  connecting  the  copy  as  a  child  of 
the  original  leaf.  Clearly,  this  rule  can  add  no  new  nodes  to  the 
leaf  set  of  a  DT,  and  hence  is  basically  a  null  operation  and 
need  not  be  included  in  the  definition  of  a  DT.  The  FDs  of  the 
form  X  ->  X  are  handled  by  part  (1)  of  the  DT  definition. 

Augmentation  corresponds  to  the  addition  of  extra  leaf 
nodes  connected  ■♦■o  an  internal  node  of  the  dT.  All  of  the 
children  of  any  node  that  was  added  by  augmentation  could 
themselves  have  been  added  by  augmentation.  Consider  a  DT  in 
which  augmentation  was  used  to  produce  what  is  now  a  non- leaf 
node,  E,  of  the  tree.  One  can  eliminate  E  from  the  tree  by 
replacing  it  by  all  of  its  descendants  that  are  leaves.  Doing 
this  to  all  internal  nodes  that  were  produced  by  augmentation 
yields  a  DT  in  which  all  applications  of  augmentation  produce 
leaves.  Similiarly,  one  application  of  augmentation  at  the  very 
last  step  of  a  derivation  is  all  that  is  needed  to  derive  any 
derivable  FD,  Therefore,  we  do  not  need  to  use  augmentation  in 
DTs;  after  a  DT  is  constructed  we  can  simply  add  any  attribute  to 
the  left  side  of  the  FD  it  represents.  This  leads  us  to  the 
following  theorem  for  the  completeness  of  DTs. 

Theorem  5:  For  a  given  FD  g:X  ->  C  and  a  set  of  FDs  G,  g  e  G-*-  if 
and  only  if  there  is  a  G-based  X-DT,  T,  rooted  at  C, 

^1221  Let  T  be  a  G-based  X-DT  rooted  at  C.  T  represents  an  FD 
X'  ->  C  in  G+  where  X’  c  X.  Hence,  by  lemma  4  and  augmentation, 
g  €  G+.  To  prove  the  converse,  we  know  that  if  g  6  G'*',  then  there 
is  a  sequence  of  (say)  N  applications  of  Armstrong's  axioms 
yielding  g  from  G.  From  the  above  discussion,  we  can  assume  there 
are  no  applications  of  reflexivity  in  the  sequence,  and  that 
applications  of  augmentation  are  all  postponed  to  the  last  step. 
Thus,  the  first  N-1  steps  are  all  applications  of 
pseudotransitivity  and  can  be  simulated  by  a  G-based  X-DT  rooted 
at  C ,  n 


Py  the  theorem,  DTs  serve  as  a  model  for  derivations  of 
FDs  in  which  the  right  side  is  a  single  attribute.  However,  to 
show  that  X->yi,..Yk  is  derivable,  it  is  sufficient  to  construct 
derivation  trees  for  X- >Y 1 , . . . , X- >Y k.  Thus,  the  concept  of  a  DT 
is  general  enough  for  our  needs. 

11,2.2  Additional  Properties  of  Derivation  Trees 

Hsinq  theorem  5  and  lemma  4,  we  can  now  prove  lemmas  1 
and  2  which  were  stated  in  Part  I  without  formal  proofs. 
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1  section  T.4.1):  Let  G  be  a  set  of  FDs,  and  let  g:X 
->  Y  be  an  FB  in  G,  If  h:V  ->  W  is  in  G<-  and  g  is  used  for  some 
derivation  of  h  from  G*,  then  V  ->  X  is  in  G-»-. 

££221  Without  loss  of  generality  we  may  assume  that  W  is  a 
single  attribute.  If  h  e  G-*-  and  g  is  used  in  some  derivation  of 
h,  then  there  is  a  G-based  V-DT,  T,  rooted  at  W  in  which  g 
appears.  Every  attribute  of  X  is  a  node  of  T.  Hence,  we  can 
apply  lemma  4  to  obtain  V  ->  X.  n 

The  proof  of  lemma  2  uses  lemma  1  and  the  fact  that  each 
6  attribute  appears  in  only  one  FB. 

Lemma  2  (cf.  section  1.5.3):  If  X  ->  9  is  in  a  set  of  FDs  G, 
then  for  any  ncnredundant  covering  H  of  G,  either  X  ->  9  is  in  F 
or  Y  ->  0  is  in  H,  where  Y  ->  X  and  X  ->  Y. 

Proof  If  X  ->  9  is  in  H,  then  we  are  done,  so  assume  not.  Since 
H  covers  G,  there  must  be  a  derivation  for  X  ->  9  from  H.  Le^-  Y 
->  0  be  the  root  FD  of  an  H-based  derivation  tree  for  X  ->  0.  By 
l=^mma  1,  X  ->  Y.  To  show  Y  ->  X,  we  examine  G-*-.  In  G,  X  ->  ©  is 
the  only  FD  containing  9.  Thus,  any  derivation  for  any  FD  in  G* 
with  9  on  the  right  side  must  use  X  ->  9  as  the  root  FD.  In 
particular,  X  ->  0  is  the  root  FD  of  any  derivation  for  Y  ->  e 
from  G.  Hence,  by  lemma  1,  Y  ->  X.  a 


A  priori,  derivation  trees  can  be  arbitrarily  large. 
The  follwing  lemma  states  that,  for  all  practical  purposes,  we 
can  restrict  our  attention  to  '’small”  trees.  The  rationale 
behind  the  lemma  underlies  our  work  in  the  next  section.  The 
lemma  is  essentially  the  same  as  a  well  known  result  about 
derivation  trees  in  the  theory  of  context  free  languages. 

Lemma  5:  If  g  €  G-*-  then  there  exists  a  G-based  derivation  tree 
for  g,  T,  such  that  in  T  no  path  from  the  root  to  a  leaf  contains 
more  than  one  occurrence  of  any  attribute. 

Proof  Suppose  g  €  G-*-  and  let  T1  be  a  G-based  derivation  tree  for 
g.  If  T1  satisfies  the  condition  of  the  lemma,  then  T  =  T1  and 
we  are  done.  Otherwise,  let  us  look  at  some  path  {A1,...,Ak), 
where  A1  is  the  root  and  Ak  is  a  leaf,  such  that  the  path 
contains  at  least  two  occurrences  of  some  attribute  A.. 

Let  Ai  and  Aj  be  two  occurrences  of  A  where  i  <  j.  If 
we  excise  from  T1  the  subtree  rooted  at  Ai  (=A)  and  replace  it  by 
the  subtree  rooted  at  Aj  (=A.)  ,  then  we  obtain  a  tree  T2  such 
that : 

1.  All  FDs  used  in  T2  belong  to  G. 

2.  The  set  of  leaves  of  T2  is  a  subset  of  the  set  of  leaves 
of  T 1  . 

3.  T2  and  T1  have  the  same  root. 


U,  T2  has  fewer  nodes  than  T1. 

Therefore,  T2  is  a  G-based  derivation  tree  for  g  and  is  smaller 
than  T 1 . 


This  process  of  replacing  subtrees  by  smaller  subtrees 
can  be  continued  as  long  as  the  trees  produced  do  not  satisfy  the 
condition  of  the  lemma.  Since  these  trees  contain  fewer  and 
fewer  nodes,  the  process  must  terminate.  The  last  tree  produced 
by  the  process  satisfies  the  condition  and  is  the  reguir'=‘d  tree 
T.  n 


II.  3  A  LINEAF  TIME  f!i;PlBEH SHIP  ALGORITHM 
II. 3.  1  Overview  of  the  Membership  Problem 

The  membership  Problem  for  a  set  G  of  FDs  is:  Given  an 
FD  g,  decide  if  g  e  G+.  In  this  section  we  present  an  algorithm 
that  solves  the  problem  in  time  proportional  to  the  size  of  G. 
In  the  following  we  assume,  without  loss  of  generality,  that  the 
right  side  of  g  is  a  single  attribute. 

Since  g  S  G+  if  and  only  if  there  exists  a  G-based 
derivation  tree  for  g,  one  obvious  way  to  solve  the  problem  is  to 
try  to  find  such  a  derivation  tree  or,  at  leas-*-,  prove  that  one 
exists.  From  lemma  5  it  follows  that  a  search  for  such  a 
derivation  tree  must  terminate.  For  any  set  G  of  FDs,  the  number 
of  attributes  in  G  is  fini+e  and,  therefore,  the  number  of  G- 
based  derivation  trees  satisfying  the  condition  of  lemma  5  is 
finite.  Given  g,  one  can  decide  if  g  e  G-*-  by  enumerating  these 
trees  and  checking  each  one  to  see  if  it  is  a  derivation  tree  for 
g.  However,  the  number  of  these  trees  may  be  guite  large  and 
this  enumeration  algorithm  may  be  too  time  consuming. 

A  more  feasible  approach  to  the  problem  is  to  try  to 
construct  a  derivation  tree  for  g  step  by  step.  While  we  may 
occasionally  make  a  mistake  by  trying  an  FD  that  is  not  used  in  a 
DT  for  g,  if  the  number  of  such  mistakes  is  not  too  large  then 
the  resulting  algorithm  can  be  guite  efficient.  In  r2]  a  bottom- 
up  algorithm  for  constructing  a  DT  for  g  is  presented  tha+-  works 
in  time  roughly  proportional  to  the  square  of  the  size  of  G.  Ir 
the  next  subsection,  we  present  an  improved  version  of  this 
algorithm  that  works  in  time  linearly  proportional  to  the  size  of 
G. 

I I . 3 . 2  The  Algorithm 

Let  G  =  fg1,...,gn)  be  a  set  of  FDs  involving 
attributes  from  the  set  [A1,...,Am}.  We  assume  that  G  is  given 
as  a  string  of  pairs  where  each  pair  represents  an  FD  and 
consists  of  a  left  side  and  a  right  side.  Each  side  is  a 
sequence  of  attributes.  Attributes  are  represented  as  numbers  in 
the  se^-.  {1,...,m}.  The  length  of  the  string  representing  G  will 
be  denoted  by  | G  1  . 
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Let  ->  C  be  given  where  (B 1 , . , . , Bk, C)  c 

(;a1  , .  .  . , Am] .  To  check  if  g  6  we  try  to  compute  the  set  of 
attributes  that  are  functionally  dependent  upon  B1,...,Bk  in  G. 
Then,  g  €  G-*-  if  and  only  if  C  is  in  this  set.  This  can  be  done 
as  follows. 


We  use  a  set  variable,  DEPEND,  to  hold  attributes  that 
are  functionally  dependent  upon  B1,...,Bk.  Initially,  we  set 
DEPEND  to  {B1,...,Ek}  as,  clearly,  each  Ri  is  functionally 
dependent  on  this  set  (by  reflexivity  and  augmentation).  Given 
the  set  DEPEND,  we  look  for  an  FD  in  G  such  that  its  left  side  is 
contained  in  DEPEND  but  its  right  side  is  net.  Since  every 
attribute  on  its  left  side  is  functionally  dependent  upon 
B1,..,,Bk,  so  is  also,  by  pseudotransitivity,  every  attribute  on 
its  right  side.  Therefore,  the  attributes  on  its  right  side  are 
added  to  DEPEND.  (Conceptually,  we  start  with  DEPEND  containing 
the  leaf  set  of  a  DT  for  g.  Each  time  we  find  a  node  whose 
children  are  all  in  DEPEND  we  add  the  node  to  DEPEND.)  This 
operation  is  iterated  until  no  new  attributes  to  be  added  arc 
found.  Then  DEPEND  contains  all  attributes  that  label  nodes  of 
f B 1 , . . . , Bk} -DTs  and  g  e  G+  if  and  only  if  C  €  DEPEND.  The  method 
is  formally  implemented  as  Algorithm  3,  see  figure  8.  (For 
brevity,  we  use  in  the  algorithm  the  abbreviations  LS,  FS  for 
left  side  and  right  side,  respectively.) 

To  analyze  the  time  complexity  of  the  algorithm  we  note 
that  in  each  iteration  of  the  OUTER  loop  (except  when  the  final 
iteration  results  in  FOOND  =  FALSE)  at  least  one  attribute  is 
added  to  DEPEND,  In  the  worst  case,  the  number  of  iterations  of 
OUTER  may  be  close  to  m.  In  any  such  iteration,  the  INNER  loop 
scans  the  input  string  G.  Therefore,  in  the  worst  case,  the 
total  time  spent  by  the  algorithm  may  be  proportional  to  mlG|. 
(Actually,  it  seems  that  even  more  time  is  reguired.  However,  in 
a  clever  implementation  this  time  bound  can  be  achieved.  We  will 
not  go  into  the  details  now,  as  we  will  present  a  more  efficient 
algorithm.)  He  will  now  try  to  improve  the  algorithm  so  as  to 
reduce  this  time  bound. 

The  algorithm  is  obviously  inefficient.  First,  we  note 
that,  when  an  FD  gi  satisfies  the  condition  in  INNER,  its  right 
side  is  added  to  DEPEND,  The  values  assumed  by  DEPEND  form  a 
monot onically  increasing  sequence  of  sets;  so  gi  will  not  satisfy 
the  condition  a  second  time  and  it  need  not  be  checked  in  future 
iterations.  Another  problem  is  the  fact  that  each  attribute  on 
the  left  side  of  each  FD  is  checked  in  each  iteration  of  OUTFR. 
Now,  once  an  attribute  on  the  left  side  of  an  FD  is  known  to  be 
in  DEPEND,  it  is  redundant  to  check  repeatedly  this  attribute's 
membership  in  DEPEND. 

These  two  problems  can  be  solved  by  appropriate  changes 
to  the  algorithm.  The  first  problem  was  actually  solved  in  the 
algorithm  presented  in  f2];  the  second  can  be  solved  similiarly. 
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Figure  8 
ALGORITHM  3 

Deciding  the  Menil’££ski.£  ££2]2i£® 

INPUT:  A  set  G  of  FDs  and  an  FD  g:31,...,Bk  ->  C. 

OUTPUT:  'yes*  if  g  S  G+,  ‘no’  if  G  -e  G+ . 

/*  Data  structures  ’*'/ 

DEPEND:  The  set  of  all  attributes  found  to  be  dependent 

on  {E1,...,Pk}  so  far. 

(FLAG,  FOUND):  Boolean  variables; 

/♦  Initialize  ♦/ 

DEPEND  =  {E1 ,  .  .  .  ,  Bk}  ; 

FLAG  =  TRUE; 

/♦  Build  up  the  set  DEPEND  */ 

OUTER:  do  while  (FLAG); 

FOUND  =  FALSE; 

INNER:  do  for  each  gi  €  G; 

if  (LS  (gi)  c  DEPEND  5 
ES(gi)  ^c  DEPEND) 
then  do; 

add  RS (qi)  to  DEPEND; 

FOUND  =  TRUE; 
end  ; 

end  INNER; 

FLAG  =  FOUND; 
end  OUTER; 

/♦  Print  results  */ 

P  PINT :  if  (C  e  DEPEND) 

then  print  'yes ' ; 
else  print  ' no' ; 
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However,  even  with  these  two  changes,  the  worst  case  time  bound 
would  still  be  0(ni|G|).  The  reason  is  that  these  problems  are 
special  cases  of  a  more  general  problem,  namely,  that  in  each 
iteration  of  OUTER  all  of  G  is  scanned  although  only  a  small  part 
of  G  is  actually  involved  in  any  operation.  Thus,  if  G  contains 
FDs  whose  left  side  contain  attributes  which  are  not  derivable 
from  then  these  FDs  will  be  scanned  in  each 
iteration.  ^^Iso,  even  in  the  algorithm  presented  in  f  2  ],  the 
left  side  of  an  FD  that  eventually  does  appear  in  a  DT  for  g  is 
unnecessarily  scanned  many  times  before  its  right  side  is  added 


to  DEPEND. 


Our  problem  then  is  to  find  a  way  to  ch 
algorithm  such  that  the  following  holds:  In  each  itera 
OUTER  an  FD  is  visited  only  if  there  is  some  operati 
performed  on  it  in  that  iteration,  and  only  the  attribut 
are  actually  involved  in  the  operation  are  visited. 

The  basic  operation  on  an  attribute  on  a  left  si 
FD  is  to  "mark”  it  as  belonging  to  DEPEND.  The  basic  o 
on  an  FD  is  to  add  its  right  side  to  DEPEND  if  its  left 
contained  in  DEPEND.  I^s  new  attributes  are  added  to  DE 
each  iteration,  it  seems  reasonable  to  visit  an  FD  onl 
left  side  contains  an  occurrence  of  an  attribute  that  wa 
to  DEPEND  in  the  previous  iteration.  Each  such  occur 
"marked'*  and  never  visited  again  (and,  if  all  attributes 
corresponding  left  side  are  marked,  then  the  right  side 
to  DEPEND  and  is  also  never  visited  again)  .  The  questio 
can  we  arrange  to  visit  only  these  occurrences  of  attri 
the  left  sides  of  FDs;  that  is,  how  can  we  locate  them 
scanning  all  of  G? 
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s  added 
rence  is 
on  the 
is  added 
n  is  how 
butes  on 
without 


To  solve  this  problem,  we  propose  to  use  linked  lists 
that  are  threaded  through  the  input  string.  For  each  attribute 
in  {A  1 , . . . , Am} ,  we  have  a  linked  list  of  all  occurrences  of  that 
attribute  on  left  sides  of  FDs.  Thus,  each  attribute  on  the  left 
side  of  an  FD  appears  in  excatly  one  such  list.  After  an 
attribute  is  added  to  DEPEND  we  can  follow  the  links  on  its 
corresponding  list,  "mark"  each  occurrence  of  the  attribute  on 
the  linked  list  as  being  derivable  from  B1,,..,Bk,  and  check  if 
the  FD  in  which  it  appears  now  satisfies  the  condition  in  INNER. 

The  left  side  of  an  FD  is  in  DEPEND  if  and  only  if  all 
of  the  attributes  on  that  left  side  are  marked.  To  determine  this 
latter  condition,  we  maintain  a  counter  for  each  FD.  The  counter 
holds  the  number  of  attributes  that  are  on  the  left  side  of  the 
FD  and  do  not  yet  belong  to  DEPEND.  "Marking"  an  occurrence  of 
an  attribute  in  an  FD  then  reduces  to  decrementing  the  counter  of 
that  FD ;  checking  whether  the  left  side  of  the  FD  is  contained 
in  DEPEND  reduces  to  comparing  the  value  in  the  counter  to  zero. 
If  the  counter  can  be  accessed  directly  from  the  occurrence  of 
the  attriubte,  then  these  two  operations  can  be  done  in  a 
constant  number  of  steps. 

Notice  that  each  occurrence  of  an  attribute  in  the  input 
string  is  visited  at  most  once,  since  an  attribute  is  added  to 
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Figure  9 


ALGORITHM  4 

i2®£i^i£.3  ill®  ££2^1^51 


INPUT:  A  set  G  of  FDs  and  an  FD  ->  C. 

orjTPOT:  'yes*  if  g  e  G-^,  'no*  if  g  -.e  G+. 


/*  Data  Structures  */ 

FD(1:n):  FD  (i)  is  a  structure  describing  the  i-th  FD, 

consisting  of  an  integer  COUNTER  and  a  FIGHT_SIDE 
containing  an  attribute, 

ATTRLIST  ( 1 : tn)  :  a  singly-linked  list  of  those  FDs  with  Ai  on  their 

left  hand  sides. 

DEPEND:  the  set  of  all  attributes  found  to  be  dependent 

upon  {3l,.,.,Bk)  so  far. 

NEWDEFEND:  the  subset  of  DEPEND  that  has  not  yet  been  examined. 


/*  Preprocess  G  to  build  ATTRLIST  and  FD  */ 

dc  for  each  gi  €  G; 

do  for  each  Ai  €  LS <qi) ; 

add  gi  to  ATTRLIST ( j) ; 
increment  COUNTER  of  FD  (i)  by  1 ; 

end  ; 

set  BIGHT_SIDE  of  FD(i)  to  be  RS(gi); 


end; 
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Figure  9  -  page  2 


/♦ 


si  iil§  algorithm 


♦/ 


/*  Initialize  ♦/ 

DEPEND  =  {B1  .  ,Bk}  ; 

NEWDEPEND  =  DEPEND; 

/♦  Build  up  the  set  DEPEND,  ♦/ 

OUTER:  do  while  (NEHDE PEND  is  not  empty); 


remove  an  attribute  from  NEWDEPEND 

and  assign  it  to  the  variable  NEXT_TO_CHECK ; 


INNER:  do  for  each  gi  on  ATTRL 1ST (N EXT_TO_CHECK) ; 


decrement  COUNTER  of  FD(i) 
if  ((COUNTER  of  FD(i)  =  0) 
do  for  each  attribute  A 


by  1  ; 

then 

in  RIGHT__SIDF  of  FD  (i) 


if 


(A  -.e  DEPEND)  then 
add  A  to  DEPEND  and 


NEWDEPEND; 


end ; 


end  INNER; 
end  OUTER; 

/*  Print  results  */ 

PRINT  :  if  (C  e  DEPEND) 

then  print  *yes' 
else  print  ‘no* 


DEPEND  at  most  once.  Attributes  that  are  not  derivable  froir 
will  not  be  visited  at  all.  This  solves  all  the 
problems  mentioned  above.  Most  important,  though,  is  the  fact 
that  since  each  visit  to  an  occurrence  of  an  attribute  takes  a 
bounded  number  of  steps  and  each  occurrence  is  visited  at  most 
once,  the  algorithm  takes  time  linearly  proportional  to  IG).  The 
algorithm  is  formally  implemented  as  Algorithm  4,  see  figure  9. 

have  assumed  that  the  right  side  of  g  is  a  single 
attribute.  This  assumption  was  for  simplicity  of  presentation 
only  and  is  not  essential  to  the  algorithm.  If  the  right  side  of 
g  is  C1,...,Cp  then  the  only  change  needed  is  to  check  in  the 
printing  stage  if  C1,..,,Cp  are  elements  of  DEPEND. 


I I . 3 . 3  Ana  iisis  of  Ik®  Algorithm 


In  this  section  we  prove  that  .Algorithm  4  is  correct  and 
analyze  its  time  complexity. 

The  preprocessing  step  of  the  algorithm  consists  of  a 
single  scan  of  G.  For  each  occurrence  of  an  attribute,  a 
constant  number  of  steps  is  performed.  Therefore,  this  part 
terminates  and  takes  time  OdGj).  After  termination  of  this  part 
the  following  hold: 


1 . 


2. 


For  each  gi  6?  G 

a.  The  value  of  the  CCUNTSS  of  FD(i)  is  equal  to  the 
number  of  attributes  on  the  left  side  of  gi. 

b.  The  set  PIGHT_SIDE  of  FD  (i)  contains  the  attributes 
on  the  right  side  of  gi. 

For  each  attribute  Aj,  each  FD  that  contains  Aj  on  its 
left  side  is  on  ATTFLIST(j). 


We  now  turn  our  attention  to  the  second  part  --  the  main 
body  of  the  algorithm.  Me  present  an  informal  proof  that  this 
part  terminates  and  produces  correct  results. 

The  sets  DEPEND  and  NEHDEPEND  are  initialized  to 
(E1,...,Bk}.  The  only  other  place  that  an  attribute  can  be  added 
to  these  sets  is  in  the  if-statement  in  the  INNER  loop.  Since 
attributes  are  never  deleted  from  DEPEND,  it  follows  from  the 
condition  in  the  if-statement  that  an  attribute  can  be  added  to 
these  sets  at  most  once.  In  each  iteration  of  OUTER  one 
attribute  is  removed  from  NEWDEPFND  so  the  number  of  iterations 
is  exactly  the  number  of  attributes  added  to  the  sets  and  is  at 
most  m.  In  any  iteration  of  OUTER,  the  INNER  loop  has  the  form 
’’do  for  each  element  of  a  finite  sot”,  and  must  therefore 
terminate.  From  this  it  follows  that  the  algorithm  terminates 
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attributes  on  the  left  side  of  gi  that  either  are  not  in  DFPEND 
or  are  in  the  intersection  of  DEPEND  and  NEWDEPEND.  This  is 
obviously  true  at  the  time  of  the  first  test,  since  then  DEPEND  = 
NFHDEPEND  and  each  COUNTER  is  equal  to  the  number  of  attributes 
on  the  left  side  of  the  corresponding  FD  (see  la.  above) .  In  an 
iteration  of  OUTER  one  attribute  is  removed  from  NEWDEPEND  and 
the  COUNTER  of  any  FD  that  contains  it  on  its  left  side  is 
decremented;  so  the  claim  is  true  after  the  iteration. 

It  is  easy  to  show  that  each  attribute  added  to  DEPEND 
is  dependent  by  G  on  B1,...,Bk.  This  is  trivially  true  for 
B1,...,B)t.  Any  other  attribute  is  added  only  if  it  appears  on 
the  right  side  of  some  gi  and  the  COUNTER  of  FD(i)  is  0,  Py  the 
above  observation,  this  means  that  all  attributes  on  the  left 
side  of  gi  have  already  been  added  to  DEPEND  and  are  therefore 
(by  induction  hypothesis)  derivable  from  B1,..,,Bk,  Thus  the  new 
attribute  is  also  derivable  from  them. 

Finally,  we  show  that  all  attributes  that  depend  by  G  on 
B1,...,Bk  will  be  in  DEPEND  when  the  algorithm  terminates.  We 
will  use  induction  on  the  depth  of  derivation  trees.  For 
derivation  trees  of  depth  zero,  we  have  to  consider  only 
B1,...,Bk  and  they  are  all  in  DEPEND.  Given  an  attribute  that 
has  a  derivation  tree  of  depth  i+1,  we  look  at  the  root  FD  of  the 
tree,  g -j :  D1  .  . .  Dp->E .  The  attributes  D1,...,Dp  all  have 
derivation  trees  of  depth  <  i.  Therefore,  each  of  these 
attributes  is  added  to  DEPEND.  Now,  when  the  last  of  these  D's 
is  removed  from  NEWDEPEND,  the  COUNTER  of  FD(j)  will  be  set  to 
zero  so  E  will  be  added  to  DEPEND  unless  it  is  already  there. 
This  concludes  the  correctness  proof. 

Having  proved  the  correctness  of  the  algorithm,  we  can 
now  add  a  shortcut.  When  the  attribute  C  is  added  to  DEPEND,  we 
exit  from  OUTER  and  proceed  directly  to  the  output  step. 
Obviously,  this  can  only  lead  to  a  faster  algorithm. 

We  have  already  seen  that  the  preprocessing  stage  takes 
time  proportional  to  |G|.  In  the  main  body  of  the  algorithm  each 
attribute  in  NEWDEPEND  is  removed  exactly  once.  The  processing 
in  the  corresponding  iteration  of  OUTER  consists  of  a  constant 
number  of  steps  performed  for  each  occurrence  of  the  attribute  on 
a  left  side  of  an  FD.  Similiarly,  the  RIGHT_SIDE  of  FD(i)  is 
visited  at  most  once  and  then  a  constant  number  of  steps  is 
performed.  Therefore,  the  algorithm  works  in  time  0(IG|).  (We 
have  assumed  that  )g|  <  JG|.) 

While  the  worst  case  time  of  the  main  body  of  the 
algorithm  is  0(|G|),  this  is  not  always  the  best  estimate.  If  G 
contains  many  FDs  that  do  not  contain  attributes  derivable  from 
Bl,...,Bk,  then  these  FDs  will  not  be  visited  at  all.  Also,  the 
running  time  of  the  algorithm  depends  on  the  depth  of  derivation 
trees  of  g.  If  g  has  a  shallow  derivation  tree,  its  right  side 
will  be  added  to  DEPEND  at  an  early  stage  and  the  algorithm  will 
be  faster.  (It  can  be  shown  that  all  attributes  that  have  G- 
based  fB 1 ,..., Bk) -derivation  trees  of  depth  <  i  are  added  to 
DEPEND  before  any  attribute  that  has  only  derivation  trees  of 
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depth  >i  is  added.)  These  considerations  are  of  importance  in 
cases  where  many  membership  tests  based  on  one  group  of  FDs  are 
performed.  Preprocessing  can  be  done  once  and  then  only  the  body 
need  be  applied  for  each  FD  being  tested. 

Theorem  6:  Membership  in  the  closure  of  a  set  of  FDs  can  be 
tested  in  linear  time.  a 


P^MAPK  He  have  presented  here  a  bottom- up  algorithm.  We  have 
also  developed  a  linear  time  top-down  algorithm.  In  principle, 
it  works  by  a  lef t-to-right  depth-first  expansion  of  the  required 
DT,  starting  with  the  root.  However,  it  is  quite  complicated  and 
a  description  of  it  will  not  be  given. 

I I . 3 . 4  An  Algorithm  for  a  Hes trie ted  Class  of  T a u toioqies 

It  is  known  that  there  is  a  close  relationship  between 
the  theory  of  FDs  and  the  propositional  calculus.  This  fact  was 
originally  observed  by  Delobel  and  Casey  [10].  Faqin,  in  a  recent 
paper  [11],  has  given  a  clear  statement  of  this  relationship  and 
presented  two  alternative  proofs.  We  will  now  show  that  it 
follows  from  this  relationship  that  t autologihood  can  be  decided 
for  a  restricted  class  of  propositional  formulas  (to  be  defined 
later)  in  linear  time. 

In  the  following  we  use  =>  as  the  implication  symbol  of 
the  propositional  calculus.  Let  N  be  the  following  mapping  from 
FDs  and  sets  of  FDs  to  propositional  formulas: 

For  q;E1,...,Bk  ->  C1,.,.,Cp  N  (g)  =  B1S...&Bk  =>  C15..,5Cp 

For  G  =  (q1,...,qn}  N  (G)  =  N  (g 1) S. , . DN  (gn) 

The  relationship  as  stated  by  Fagin  is: 

Theorem  (Fagin  [11]):  For  any  FD  g  and  for  any  set  G  of  FDs,  g  e 
G+  if  and  only  if  N{g)  is  derivable  in  the  propositional  calculus 
from  N  (G)  . 

Proofs  can  be  found  in  [11].  We  will  sketch  briefly  the 
idea  behind  the  first  proof.  The  transformation  N  is  essentially 
a  translation  of  FDs  into  the  language  of  the  propositional 
calculus,  Fagin  showed  that  if  Armstrong’s  axioms  are  similiarly 
translated  the  result  is  a  complete  set  of  axioms  for  the 
propositional  calculus.  Therefore,  a  derivation  of  q  from  G,  when 
translated,  is  a  proof  of  N (g)  from  N (G)  that  uses  these 
translated  axioms.  Similiarly,  a  proof  of  N  (g)  from  N(G)  which 
uses  the  translated  axioms  can  be  translated  back  to  a  derivation 
of  q  from  G.  D 


In  the  propositional  calculus  a  formula  A  is  derivable 
from  a  formula  B  if  and  only  if  the  formula  A  =>  E  is  a 
tautology.  Thus  the  theorem  can  be  stated  as  follows:  "g  9.  G  + 
if  and  only  if  N(G)  =>  N (g)  is  a  tautology”.  It  now  follows  that 
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to  check  if  a  fornula  of  the  form  N  (G)  =>  N(g)  (for  some  G,  q)  is 
a  tautology  one  can  use  Algorithm  4.  One  way  of  doing  it  is  to 
translate  the  problem  back  into  a  membership  problem.  It  is  also 
easy  to  see  that  the  algorithm  can  be  applied  directly  to 
propositional  formulas  of  that  form. 

Let  us  denote  the  class  of  propositinal  formulas  in 
disjunctive  normal  form  in  which  each  disjunct  includes  at  most 
one  negated  literal  by  DNF(l-neg).  He  now  show  that 
tautolcqihood  of  formulas  in  this  class  can  be  checked  in  linear 
time. 

Theorem  7:  The  tautology  problem  for  DNF(l-neg)  can  be  decided 
in  time  linearly  proportional  to  the  size  of  the  input. 

Proof  Let  f  be  a  given  formula  in  DNF(l-neg).  He  first  observe 
that  if  f  contains  a  non-negated  literal  in  each  disjunct  then  f 
is  not  a  tautology;  it  can  be  falsified  by  assigning  the  value 
FALSE  to  all  the  literals,  Therfore  ,  we  assume  that  some 
disjuncts  of  f  do  not  contain  a  non-negated  literal.  These 
disjuncts  must  then  cosist  of  a  single  negated  literal.  He  also 
assume  that  f  contains  at  least  one  disjunct  which  consists  of  a 
single  non-negated  literal.  (If  there  is  no  such  disjunct  ,  we 
can  add  one.  For  if  C  is  a  new  literal  that  does  not  appear  in  f 
then  f  is  a  tautology  if  and  only  if  C  |  f  is  a  tautology.) 

Our  strategy  will  be  to  transform  f  into  an  equivalent 
formula  of  the  form  N (G)  =>  N  (g)  for  suitable  G  and  g.  There  are 
several  cases  to  consider; 

1.  Disjuncts  that  include  both  a  negated  literal  and  a  non- 
negated  literal.  Let  E 1  &. ,  .  8EkR-»D  be  such  a  disjunct.  We 
replace  it  by  the  equivalent  disjunct  (E IF/,  . . &Ek  =>  D) 

2.  Disjuncts  that  include  at  least  two  literals  but  no  non- 
negated  literal.  These  disjuncts  cannot  be  transformed  directly. 
Let  ElS,.,&Ek  be  such  a  disjunct  and  let  D  be  a  new  literal  which 
does  not  appear  in  f.  He  replace  the  disjunct  by  E15.  .  .SEkfr-iD 
and  add  to  f  the  disjunct  consisting  of  the  single  literal  D. 
Clearly,  the  new  formula  thus  obtained  is  a  tautology  if  and  only 
if  f  is  a  tautology.  Now  E1F . . .  6  Ek6-»D  can  be  replaced  as  in  1. 

3.  Disjuncts  consisting  of  a  single  negated  or  non-negated 
literal.  Let  these  disjuncts  be  Cl  |  ,.,  j  Cp  )  -»B1  j  ...  j  -^Bk . 
We  replace  them  by  the  single  equivalent  disjunct  P16,,.6Bk  => 
CIS. . .SCp. 

when  the  above  transformations  are  completed  we  have  a 
formula  of  the  form 

N(g)  I  -N(g1)  I  ...  I  -N(gn) 

for  some  g,g1,,,.,qn.  This  formula  is  equivalent  to  the  formula 
N  (  {g1 ,. . . ,gn} )  =>  N(g).  Now,  the  algorithm  can  be  applied  to 
this  formula.  Since  all  the  transformations  described  above  can 
be  performed  in  linear  time,  the  theorem  follows.  n 
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Me  have  chosen  a  proof  by  way  of  reduction  for  ease  of 
presentation  only.  The  algorithm  can  be  applied  directly  to 
formulas  in  DNF(l-neg),  Some  changes  in  the  terminology  used  in 
the  algorithm  will  be  necessary;  also,  a  consideration  of  the 
possible  types  of  disjuncts  will  have  to  be  incorporated  in  the 
algorithm.  We  leave  it  to  the  reader  to  rewrite  the  algorithm 
for  DNF (1-neg) . 
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II.  4  IMPLEMENTATION  OF  THE  SYNTHESIS  A^ORITHM 
I T  .  4  .  1  Pl§2 1 oc es s  i  ng_ 

In  this  section  we  present  an  implementation  of  the 
synthesis  algorithm.  Algorithm  2  (cf.  section  1.6.4),  using  the 
membership  test  as  a  basic  operation.  While  this  implementation 
seems  to  be  natural  and  quite  efficient,  others  are  conceivable. 
In  any  case,  the  proofs  of  the  properties  of  schemas  synthesized 
by  Algorithm  2  do  not  depend  on  how  it  is  implmented. 

In  the  first  two  steps  of  the  algorithm,  extraneous 
attributes  and  redundant  FDs  are  eliminated.  Each  time  an 
attribute  (or  an  FD)  is  eliminated,  we  obtain  a  new  set  of  FDs. 
When  the  membership  test  is  applied  to  this  new  set,  the 
preprocessing  has  to  be  redone.  However,  the  difference  between 
the  old  set  and  the  new  set  is  quite  small.  It  would  be  much  more 
efficient  if  preprocessing  is  done  only  once  and  each  time  the 
set  of  FDs  is  changed,  only  the  appropriate  changes  in  its 
preprocessed  form  are  performed. 

If,  instead  of  singl y- linked  lists,  we  use  doubl 
lists  to  connect  all  occurrences  of  each  attribute 
eliminating  an  attribute  can  be  done  in  a  constant  n 
steps.  Eliminating  an  FD  is  done  by  eliminating  all  at 
on  its  left  side.  Thus,  the  total  time  spent  in  the  f 
steps  of  the  algorithm  in  eliminating  attributes  and  FDs 
proportional  to  the  number  of  attributes  eliminated 
proportional  to  the  size  of  the  input  in  the  worst  case. 

rising  the  same  (or  almost  the  same)  preprocessed  input 
for  several  membership  tests  means  that  the  counters  will  have  to 
be  reset  for  each  test.  This  also  can  be  done  efficiently,  e.g., 
by  having  two  counters  for  each  FD,  one  of  which  always  contains 
the  original  value.  In  what  follows  we  assume,  therefore,  that 
the  input  has  been  preprocessed  once  and  for  all.  Each  time  an 
attribute  (or  an  FD)  is  eliminated,  the  necessary  local  changes 
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are  performed.  To  apply  the  membership  test  will  mean  to  apply 
the  main  body  of  the  membership  algorithm. 

I I . 4 . 2  I m pie meg tat ion 

Let  f:A1,.../Ak  ->  B  be  an  FD  in  F.  The  attribute  hi  is 
extraneous  in  f  (cf.  section  1.5.1)  if  A1,...,Ai-1,  Ai+1,...,Ak 
->  E  is  in  F-*-.  The  procedure  for  eliminating  extraneous 
attributes  is  shown  in  figure  10. 

Note  that  during  the  execution  of  the  IN  group  the  left 
side  of  g  changes  dynamically.  The  loop  is  performed  exactly 
once  for  each  attriubte  in  the  original  left  side  of  g.  It  is 
quite  clear  that  each  time  an  attriubte  is  eliminated  from  the 
left  side  of  g,  the  closure  of  the  resulting  set  of  FDs  is  the 
same  as  Y*,  (The  FD  g  is  replaced  by  g*  such  that  g'  is  in  G  and 
q  is  derivable  from  q'  by  augmentation,  therefore  G-*-  =  (G  -  (g)  u 
{g*})**)  To  prove  the  correctness  of  the  algorithm  it  suffices  to 
show  that  after  it  terminates,  the  left  side  of  any  g  in  G  does 
net  contain  extraneous  attributes. 

Suppose  the  left  side  of  g  after  termination  is 
A1,...,Ai  and  that  Ai  is  extraneous.  This  means  that  A1,...,Ai- 
1 , Ai+ 1 , . .  .  , Aj  ->  ES  (g)  is  in  G-*-.  But  then,  when  IN  was  executed 
for  Ai,  the  left  side  contained  A 1 , . . . , Ai- 1 , Ai+ 1 , , . . , A j  so  Ai 
should  have  been  eliminated  --  a  contradi tion . 

We  use  a  similar  procedure  to  implement  the  second  step 
of  Algorithm  2  —  elimination  of  redundant  FDs  (see  figure  11). 
Here  again  the  membership  test  is  performed  once  for  each  FD  in 
the  original  set  G  and  H  changes  dynamically.  It  is  obvious  that 
the  closure  remains  the  same  throughout  the  execution  of  the 
procedure.  Also,  after  termination  of  the  procedure  no  h  €  H  is 
redundant.  If  h  €  (H  -  {h} ) t  then  for  any  set  H*  containing  H,  k 
e  (H'  -  {h})t,  so  h  should  have  been  eliminated  when  it  was 
tested  for  redundancy. 

The  implementation  of  step  3  (partition)  is 
straightforward.  ?!!  left  sides  are  arranged  in  a  sequence  and 
each  one  is  compared  to  the  left  sides  preceding  it  in  the 
sequence.  However,  a  more  efficient  implementation  exists.  It 
will  be  described  in  the  next  section.  For  step  4  (merge 
equivalent  Iceys)  ,  we  again  use  the  membership  test  to  check  if 
two  given  left  sides  are  equivalent.  For  step  5  we  can  use  the 
same  procedure  that  was  used  for  step  2.  We  first  order  the  set 
H+J  such  that  all  elements  of  H  come  first  and  we  perform  the 
redundancy  check  only  for  elements  of  H.  Implementation  of  step 
6  is  also  straightforward. 
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Figure  10 

I  ffiplement ation  of  Ste£  _1  of  Algorithm  4 
Elimi^^ati nq  Extraneous  Attributes 

input  =  F; 

G  =  F; 

OUT:  do  for  each  g  e  G; 

IN:  do  for  each  attribute  A  in  LS  (g)  ; 

if  (LS(g)  -  {A})  ->  RS(g)  is  in  G-^ 

then  eliminate  A  from  LS  (g)  ; 

end  IN; 

end  OUT; 

output  G; 


Figure  11 

I roglemen ta tion  of  ste£  2  of  algorithm  4 
Finding  a  Nonredundant  Covering 

input  =  G 
H  =  G; 

do  for  each  heH; 

if  he  (H  -  {h}  )  + 
then  H  =  H  -  {h}  ; 

end ; 


output  H ; 
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such  that  all  sets  in  any  class  are  equal,  sets  in  diffe 
classes  are  not  equal  and  every  set  belongs  to  one  class. 

We  first  construct  an  n  by  m  matrix  where  the  ( 
element  of  the  matrix  is  1  if  Aj  €  Si  and  is  0  if  Aj  -•€  Sj. 
construction  can  be  done  in  0  (ran)  steps.  Then  we  put  all 
sets  S1,..,,Sn  into  one  class  which  is  represented  as  a  1 
Now,  we  enter  the  partitioning  process. 
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Having  at  most  n  groups  generated  in  step  3,  step  4 

consists  of  at  most  0  (n)  membership  tests,  so  it  takes  tirn.p 

0(n|F|).  In  step  5  we  perform  at  roost  n  membership  tests,  since 
only  elements  of  H  are  tested.  However,  the  test  is  based  on  the 
set  H+J.  Thus,  to  estimate  the  time  spent  in  a  membership  test, 
we  need  an  estimate  of  the  size  of  J. 

The  size  of  J  depends  on  the  way  J  is  constructed  in 
step  4.  Let  be  an  equivalence  class,  where  the 

elements  of  the  class  are  listed  in  the  order  in  which  they  have 
been  added  to  the  class.  Let  us  assume  that  each  time  we  want  to 
check  if  a  given  left  side  belongs  to  the  class,  we  compare  it  to 

the  last  element  added  to  the  class.  Then  the  FDs  in  J  that 

correspond  to  this  class  are  X1<->X2,  X2<->X3,....  Any  left  side 
appears  in  at  most  two  equivalences  in  J  (e.g,,  X2  appears  in 
X1<“>X2  and  in  X2<”>X3) .  An  equivalence  X<->Y  can  be 
represented  by  the  FDs  X->Y  and  Y->X.  Each  left  side  of  an  FD  in 
H  appears  in  at  most  four  such  FDs  and  it  follows  that  |J|  = 
0  (|H|)  =  0(|Fj)  . 

Step  5  takes,  then,  0(n|F|) .  Step  6  is  free,  if 
appropriate  operations  are  performed  in  the  previous  steps, 
i.e.,  equivalence  classes  can  be  represented  as  relations  when 
step  4  is  performed  and  redundant  attributes  in  these  relations 
are  excised  in  step  5.  Thus,  the  total  time  spent  in  the 
algorithm  is 

O(lFl)  +  0(|FJ2)  +  0{n|F|)  +  0(nm)  ♦  0(nlF|)  +  0(n|F|) 

=  0  (max(| F\ ?,  n| F| ,  nm) ) 

=  0(1F12)  . 
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part  III  THE  BCNF  AND  THE  FINDING  PPOBLEHS 


I I I • 1  INTRODUCTION 

111. 1.1  A  Survey  of  the  Results 

In  the  first  two  parts  of  this  work  we  have  exhibited 
the  feasibility  of  the  algorithmic  approach  to  the  problem  of 
synthesizing  3NF  schemas.  We  turn  now  to  two  related  problems  - 
Boyce  Codd  normal  form  and  the  existence  of  additional  keys.  In 
both  cases  ,  our  results  strongly  suggest  that  efficient 
algorithms  for  the  treatment  of  these  problems  do  not  exist. 

A  new  normal  form,  called  Boyce-Codd  normal  form  (abbr. 
BCNF),  which  is  strictly  stronger  then  3NF  was  presented  in  [8]. 
We  have  seen  that  any  set  of  FDs  can  be  represented  by  a  3NF 
schema  and  there  exists  an  algorithm  that  produces  such  a  schema 
for  any  given  set.  We  would  like  to  know  if  similiar  results 
hold  for  BCNF.  That  is,  can  any  set  of  FDs  be  represented  by  a 
BCNF  schema  and  can  the  synthesis  algorithm  or  an  extension 
thereof  be  used  to  produce  BCNF  schemas?  These  problems  are 
treated  in  section  2.  We  show  that  BCNF  violations  are  inherent 
in  some  sets  of  FDs;  for  these  sets  no  BCNF  schema  exists.  We 
also  show  that  even  when  a  BCNF  schema  exists  for  a  given  set  of 
FDs,  the  algorithm  may  produce  a  schema  which  is  not  in  BCNF, 
Finally,  we  prove  that  the  problem  whether  there  is  a  DCNF 
violation  in  a  given  relation  (where  a  set  of  FDs  is  also  given) 
is  NP-complete.  This  is  true  even  when  it  is  known  that  the 
relation  was  produced  by  the  synthesis  algorithm  from  the  given 
set  of  FDs.  These  results  imply  that  extending  the  synthesis 
algorithm  to  produce  a  BCNF  schema,  even  when  such  a  schema  does 
exist,  is  probably  not  computationally  feasible. 

A  relation  in  a  schema  contains  one  or  more  designated 
keys.  These  keys  may  be  specified  by  the  user  or  produced  by  the 
synthesis  algorithm.  In  addition  to  these  keys,  other  keys  may 
exist  in  a  relation  by  virtue  of  the  given  set  of  FDs.  In 
section  3  we  treat  problems  relating  to  such  additional  keys.  We 
show  that,  given  a  set  of  FDs  and  a  relation  with  some  designated 
keys,  there  may  exist  in  the  relation  additional  keys.  Again, 
this  is  true  even  when  the  relation  and  its  keys  are  produced  by 
the  synthesis  algorithm  from  the  given  FDs.  We  also  prove  that 
the  problem  of  deciding  whether  an  additional  key  exists  in  a 
given  relation  is  NP-complete.  As  in  the  case  of  the  BCNF 
violation,  these  results  seem  to  imply  that  key  finding  is  an 
inherently  difficult  problem.  Finally,  some  results  by  Lucchesi 
and  Osborne  [1h],  based  on  a  restricted  form  of  our  model,  are 
compared  to  the  other  results  in  the  section. 

111. 1 . 2  A  Review  of  Definitions 

In  order  to  give  an  adequate  and  precise  treatment  of 
the  problems  mentioned  above  we  will  review  here  some  of  the 
definitions  from  part  I  (see  section  1.2). 
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A  relational  schema  S  consists  of  a  finite  set  of 

relation  names;  for  each  relation  is  given  the  set  of  attributes 
that  appear  in  it  and  one  or  more  subsets  of  this  set  called 
•keys’.  (He  will  explain  later  why  the  word  key  appears  ir 
qoutes.)  Given  a  relation  B,  we  say  that  an  FD:X->A  is  embodied 
in  F  if  X  is  a  given  ’key*  of  R  and  A  is  any  attribute  of  R.  The 

set  of  FDs  that  are  embodied  in  the  schema  S  is  the  collection  of 

all  FDS  which  are  embodied  in  relations  of  S.  A  schema  S 

rgpresent s  a  set  F  of  FDs  if  the  closure  of  the  set  of  FDs 
embodied  by  S  is  equal  to  (cf ,  section  1.5,2). 

Given  a  relation  E  and  a  set  of  FDs  F,  a  subset  of  the 
attributes  of  R  is  a  sujperkey  of  R  if  for  any  attribute  A  in  R, 
the  FD  K->A  is  in  F-*-,  A  stipe rkey  K  is  a  ke^  if  it  does  not 
contain  any  proper  subset  that  is  also  a  superkey. 

If  the  relation  R  is  part  of  a  schema  that  represents 
the  set  of  FDs  F,  then  all  given  'key’s  of  R  are  superkeys  by 
definition.  But  they  are  not  necessarily  keys;  they  may  contain 
extraneous  attributes.  By  the  results  of  part  I,  in  any  relation 
generated  by  the  synthesis  algorithm  all  'key’s  are  actually 
keys,  Furthemore,  we  note  that  the  first  step  of  the  synthesis 
algorithm  can  be  used  as  a  procedure  for  eliminating  extraneous 
attributes  from  superkeys  and  it  can  be  applied  to  any  relation. 
Therefore,  we  will  assume  that  any  'key's  given  in  a  relation  are 
actually  keys,  that  is,  that  extraneous  attributes  have  beer- 
eliminated  . 

For  our  treatment  of  the  problems  in  this  part  we  do  not 
require  that  a  schema  be  given.  If  a  set  of  FDs  is  given  then  it 
is  possible  to  ask  about  a  single  relation  if  it  is  in  BCNF. 
Similiarly,  it  is  possible  to  ask  if  it  contains  additional  keys. 
To  summarize,  we  assume  that  the  following  is  given: 

1.  A  set  F  of  FDs. 

2,  A  relation  R  with  some  'key's  such  that 

a.  the  set  of  FDs  embodied  in  R  is  a  subset  of  F+,  and 

b.  all  'key's  given  in  E  are  keys, 

(Because  of  2b.  we  will,  from  now  on,  omit  the  quotes  from  the 
word  key.)  In  some  cases  we  will  construct  a  schema  from  the 
given  set  of  FDs,  This  is  done  only  to  show  that  the 
corresponding  result  holds  even  if  it  is  known  that  the  relation 
and  its  keys  have  been  produced  by  the  synthesis  algorithm.  No 
use  is  made  of  the  other  relations  in  the  schema, 

III. 2  EOYCE-CODD  NORMAL  FORM 
III. 2.1  Updates  in  BCNF  Relations 

Third  normal  form  was  introduced  to  solve  certain  kinds 
of  update  anomalies  and  consistency  difficulties  among  nonprime 


52 


attributes  in  a  relation 
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E  and  a  set  of  attributes 
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(cf.  section  1.2.5),  However,  3NF  does 
ms  among  prime  attributes.  Boyce-Codd 
ed  as  a  strictly  stronger  formulation  of 
to  cover  prime  attributes  as  well, 
among  attributes  in  a  Poyce-Codd  normal 
he  form  "key  functionally  determines 
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n  holds:  If  there  is  an  attribute  A.  in 
y  in  R  with  A  not  in  X  and  X  ->  A,  then 
unctionally  dependent  upon  X. 


It  is  easily  seen  that  every  BCNF  relation  is  also  in 
3NF.  For  if  X  ->  Y,  Y  />  X,  Y  ->  A  were  a  transitive  dependency 
in  a  BCNF  relation,  then  Y  ->  A  and  Y  />  X  would  be  a  violation 
of  BCNF,  On  the  other  hand,  not  every  3NF  relation  is  in  BCNF. 

An  example  of  a  3NF  relation  that  is  not  in  BCNF  is 
POSTAL_DISTPICT  (CITY,  ADDRESS,  POSTALCODE)  where  CITY, ADDRESS  -> 
POSTALCODE  and  POSTALCODE  ->  CITY,  The  only  two  keys  of 
POSTAL^DISTRICT  are  CITY, ADDRESS  and  POSTALCODE, AD DPES S .  The 
relation  POSTAL_DISTRICT  is  trivially  in  3NF,  since  it  has  no 
ncnprime  attributes.  Yet  ?OSTAL_DISTRICT  is  not  in  BCNF,  since 
POSTAL_CODE  ->  CITY  but  POSTAL_CODE  />  ADDRESS.  Notice  also  that 
ADDRESS  />  CITY,  since  an  address  (e.g.,  10  Elm  Street)  could 
appear  in  more  than  one  city. 


In  non-BCNF  rela 
the  same  as  those  caused 
attributes  on  keys.  The 
a  non-BCNF  relation  ca 
example,  consider  the 
POSTALCODE  ->  CITY  in  the 
arbitrarily  change  the 
ADDRESS,  because  by  doing 
CITY,  This  is  essentia 
violations.  Also,  the 
combination  for  a  partic 
->  CITY  connection.  Thus 
here  as  well. 


tions,  problems  ari 
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One  can  look  at  BCNF  as  an  attempt  at  making  tuple 
updates  completely  independent.  That  is,  since  each  tuple  in  a 
relation  normally  represents  an  object  or  relationship  in  the 
world  (e.g,,  see  [16]),  one  would  expect  to  be  able  to  update  any 
one  tuple  in  a  relation  without  regard  to  any  other  in  that 
relation.  The  above  example  shows  that  this  is  not  always 
possible  in  a  3NF  relation.  However,  as  we  will  now  explain,  it 
is  always  possible  in  a  BCNF  schema. 

Suppose  we,  want  to  change  the  values  of  some  attributes 
in  one  tuple  of  a  relation.  Other  tuples  may  be  influenced  by 
this  update  only  if  the  following  two  conditions  are  met.  The 
first  condition  is  that  a  given  combination  of  values  for  these 
attributes  may  appear  in  different  tuples  in  the  relation.  (This 
is  equivalent  to  saying  that  the  set  of  attributes  whose  values 
we  want  to  change  is  not  a  key  of  the  relation.)  Clearly,  only 
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tuples  in  which  these  attributes  have  the  same  values  as  in  the 
*new*  tuple  can  be  influenced  by  the  update.  The  second 
condition  is  that  this  set  of  attributes  determines  another 
attribute  in  the  relation  (because  then  the  update  may  violate 
this  dependency) .  Now,  in  a  ECNF  relation,  no  set  of  attributes 
can  satisfy  both  conditions,  A  set  is  either  a  key  or  it  does 
not  determine  any  other  attribute.  Therefore,  tuple  updates  in  a 
BCNF  relation  are  independent. 

1 1 1. 2. 2  Some  Negative  Results 

In  the  first  part  of  this  report,  we  presented  a  fast 
algorithm  for  synthesizing  a  3NF  schema  from  a  given  set  of  FDs. 
In  this  section  we  will  show  why  any  similar  approach  to  PCN F  is 
very  likely  to  fail. 

The  main  goal,  then,  is  to  find  a  BCNF  relational  schema 
that  represents  a  given  set  of  FDs.  However,  this  goal  is 
impossible  to  fulfill,  since  there  are  sets  of  FDs  that  cannot  be 
represented  by  any  BCNF  schema. 

Lemma  6:  There  is  a  set  of  FDs  that  cannot  be  represented  by  any 
BCNF  relational  schema. 

Proof  Let  F  =  (AB  ->  C,  C  ->  A}  be  a  set  of  FDs.  (These  are 
exactly  the  FDs  in  POSTAL DISTEICT ,  with  A  =  CITY,  B  =  ADDRESS, 
and  C  -  POSTALCODE.)  By  a  brute  force  examination  of  F* ,  it  can 
be  shown  that  there  are  no  two  FDs  f  and  g  in  F+  that  range  over 
a  nroper  subset  of  {A,B,C}  with  f»g:AB  ->  C,  Thus,  the  only 
relation  that  can  embody  AE  ->  C  must  be  one  that  contains  A,  E, 
and  C  as  attributes.  Eut  such  a  relation  is  not  in  BCNF,  since  C 
->  A  but  C  />  B.  Hence,  any  schema  that  embodies  F  violates 
BCNF.  n 


It  has  been  pointed  out  that  a  BCNF  schema  can  be  formed 
from  which  AB  ->  C  and  C  ->  A  can  be  extracted  [9],  However,  by 
lemma  6  any  such  method  of  extracting  these  FDs  from  the  schema 
must  involve  information  in  addition  to  the  knowledge  of  embodied 
FDs  (e.g.,  that  a  particular  join  yields  AE  ->  C  as  a  result). 
That  is,  such  a  schema  does  not  embody  AB  ->  C  and  C  ->  A,  in  our 
formal  sense.  Yet  we  know  of  no  published  relational  system  that 
allows  such  additional  information  to  be  represented  in  the  data 
definition  language.  Only  the  keys  are  known.  Hence,  our  formal 
definition  of  embodiment  closely  models  what  is  actually  feasible 
in  present-day  relational  systems.  This  means  that  some  FDs 
always  lead  to  BCNF  violations  and  require  a  special  mechanism  ^o 
solve  the  integrity  problems  induced  by  such  violations. 

The  impact  of  lemma  6  is  softened  somewhat  by  the 
observation  that  sets  of  FDs  which  cannot  be  represented  by  BCNF 
schemas  seem  to  be  quite  rare  in  modelling  real  world  situations. 
What  we  would  like  to  know  is:  In  synthesizing  a  schema  from  a 
given  set  of  FDs  F,  can  we  at  least  guarantee  that  if  a  BCNF 
schema  that  embodies  a  covering  of  F  is  possible,  then  our 
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Figure  12 

Two  Coverings,  Only  One  of  which  Violates  BCNF 


FDs 

A  ->  B,C 
B,C  ->  A 
A,D  ->  E 
F  ->  C 


Relations 
R1  (A,BxC) 

P2  (A^D,  E) 
R3  (E,C) 


FDs 

A  ->B,C 
P,C  ->  A 


B,C, D  ->  E 


F  ->  C 


(b) 


Relations 

51  (A,Bj^C) 

52  (B^Cj^D,E) 
S3 (E,C) 


In  both  cases  the  given  sets  of  FDs  are  nonredundant .  Also,  they 
have  the  same  closure.  Yet,  in  the  first  case  the  synthesized 
schema  is  in  BCNF,  in  the  second  case  it  is  not. 
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synthesis  algorithm  will  find  it?  That  is,  can  be  be  sure  to 
obtain  a  RCNF  schema  from  algorithm  2  when  such  a  schema  is 
possible?  Unfortunately,  the  answer  is  no.  In  figure  12,  we 
present  two  nonredundant  covering  of  a  given  set  of  FDs,  where 
one  covering  results  in  a  BCNF  schema  and  the  other  does  not. 
This  shows  that  BCNF  is  not  an  invariant  property  of  coverings; 
if  step  2  of  Algorithm  2  chooses  the  wrong  covering,  the  result 
will  violate  BCNF. 
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they  could  have  been  avoided .  Now,  we  are  faced  with  the  proble 
of  examining  each  relation  of  the  schema  to  check  whether  it  i 
in  BCNF. 


Checking  whether  a  relation  is  in  BCNF  is  clearly 
decidable.  The  membership  algorithm  can  be  used  to  check  every 
subset  of  attributes  in  the  relation  to  see  if  it  functionally 
determines  some  but  not  all  attributes  in  the  relation.  This 
algorithm,  though,  is  very  slow,  since  it  requires  checking  an 
exponential  number  of  sets  of  attributes.  That  a  faster 
algorithm  is  not  very  likely  to  be  found  follows  from  theorem  8, 
which  shows  the  BCNF  detection  problem  to  be  NP-coraplete. 


Theorem  8:  Given  a  set  of  attributes  X,  a  set  F  of  FDs  over  X, 
and  a  3NF  relation  R  over  a  subset  of  X,  the  question  '‘does  R 
violate  RCNF”  is  NP-complete.  The  problem  is  NP-comolete  ever, 
when  it  is  known  that  F:  is  one  of  the  relations  produced  by  ♦:he 
synthesis  algorithm  from  the  set  F. 


£^22^  problem  is  NP  computable  by  nonde  ter  mini  sticall  y 

choosing  a  subset  of  the  attributes  of  P.  and  verifying  that  this 
subset  functionally  determines  some  but  not  all  attributes 
(i.e.,  verifying  that  R  violated  BCNF).  To  show  the  problem 
NP-dif f icult ,  we  reduce  the  hitting  set  problem  [13]  to  the  BCNF 
violation  problem. 


of  E 
is 


The  hitting  set  problem  is  formulated  as  follows:  we 
are  given  a  family  {Vi}  i=1,...,n  of  subsets  of  T  =  {t1,...,tr}. 
We  have  to  decide  if  there  exists  a  set  W  c  T  such  that  for  each 
1<i<n  the  intersection  of  W  with  Vi  contains  exactly  one  element. 
The  problem  was  proved  to  be  NP-complete  in  [13]. 


We  now  show  how  to  construct,  in  polynomial  time,  f or  a 
given  instance  of  the  hitting  set  problem  a  corresponding 
instance  of  the  BCNF  violation  problem,  such  that  the  two 
instances  have  either  both  positive  solutions  or  both  negative 
solutions . 
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We  construct  the  following  set  F  of  FDs: 

1.  For  each  i,  for  each  pair  u1,  u2  of  elements  in  Ui  and  for 
each  t  €  T,  the  set  F  contains  the  FD  u1,u2  ->  t, 

2.  For  each  i,  let  xi  be  a  new  object*  The  set  F  contains  the 
FD  x1,...,xn  ->  t1, 

3.  For  each  u  e  Ui,  for  each  i,  F  contains  the  FD  u  ->  x. 

Applying  the  synthesis  algorithm  to  the  set  F,  we  obtain  the 
following  relations.  From  (1) ,  we  obtain  a  relation  B1  that 
contains  exactly  the  elements  of  T  and  in  which  the  pairs  of 
elements  of  any  Ui  are  the  synthesized  keys.  From  (2),  we  obtain 
a  relation  E2  that  contains  x1,,..,xn  and  t1  and  its  synthesized 
key  is  x1,...,xn.  From  (3)  we  obtain  relatons  R3,.,,,Er+2,  such 
that  in  Rj+2  the  key  is  tj  and  it  contains  all  xi  such  that  tj  e 
Ui. 

We  prove  now  that  R1  has  a  BCNF  violation  if  and  only  if 
the  hitting  set  problem  has  a  positive  solution.  First,  suppose 
W  is  a  subset  of  T  that  intersects  each  Ui  exactly  once.  Then  W 
->  x1,...,xn  follows  from  the  FDs  in  (3),  so  applying  the  FD  in 
(2)  we  obtain  W  ->  t1  in  the  closure.  Flowever,  W  does  not 
contain  a  key  of  FI,  since  the  only  FDs  that  have  t2,...,tn  on 
the  right  hand  side  are  those  of  (1)  and  W  does  not  contain  two 
elements  of  any  Ui.  Conversely,  suppose  R1  has  a  BCNF  violation. 
Let  Y  c  T  be  a  subset  of  attributes  of  P.  that  derives  some  tk  e  T 
but  not  tj  e  T.  Clearly,  Y  cannot  contain  any  pair  of  elements 
of  any  Ui.  The  only  way  then  to  derive  any  element  of  T  is  by 
using  x1,...,xn  ->  t1,  which  means  that  Y  contains  one  element  of 
atleasteachUiandisahit ting  set. 

Finally,  we  observe  that  the  number  of  pairs  of  elements 
in  T  is  r2,  so  the  number  of  FDs  we  construct  in  (1)  is  at  most 
r^.  It  follows  easily  that  the  instance  of  the  BCNF  violation 
problem  is  constructed  in  time  proportional  to  a  polynomial 
function  of  the  size  of  the  instance  of  the  hitting  set  problem. 
The  theorem  follows.  n 


To  summarize  the  above  results,  we  have: 

1.  Not  every  set  of  FDs  can  be  represented  by  a  BCNF 
schema, 

2.  Algorithm  2  does  not  necessarily  synthesize  a  BCNF 
schema,  even  when  such  a  schema  is  possible,  and 

3.  the  problem  of  determining  whether  a  relation  violates 
BCNF  is  NP-complete. 

Together,  these  results  strongly  suggest  that  FDs  are  too  strong 
a  model  to  obtain  BCNF  schemas  algorithmically.  Perhaps  a  better 
approach  is  to  develop  a  weaker  model  of  real  world 
relationships,  a  model  whose  additional  structure  (and  weaker 
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modelling  power)  makes  the  detection  of  BCNF  violations  an  easier 
problem.  For  any  model  that  is  strong  enough  to  model  FDs  must 
manifest  the  above  problems. 


I I I . 3  KEY  finding 

In  this  section  we  treat  the  problem  of  the  possible 
existence  of  unknown  keys.  If  the  keys  synthesized  by  Algorithm 
2  (i.e.,  the  "synthesized  keys”)  for  a  particular  relation  always 
included  all  of  the  keys  of  the  relation,  then  there  would  be  no 
problem.  However,  a  relation  can  have  keys  that  are  not 
synthesized  by  Algorithm  2.  For  example,  given  the  set  of  FDs 
{AB->C,  C->B} ,  the  3NF  schema  constructed  by  the  synthesis 
algorithm  contains  the  relations  P1(A_tB,C)  and  P2(C,B).  Clearly, 
AC  is  an  additional  key  of  FI,  although  it  was  not  synthesized. 
Given  that  these  additional  keys  exist,  the  question  we  would 
like  to  examine  is:  How  difficult  is  it  to  find  these  keys? 

One  approach  to  finding  the  keys  of  a  relation  is  to 
check  all  subsets  of  the  attributes  in  the  relation  starting, 
say,  with  subsets  of  one  element,  then  subsets  of  two  elements, 
etc.  Since  the  number  of  such  subsets  grows  very  quickly  with 
the  size  of  the  relation,  it  would  be  helpful  to  discover  a 
condition  that  will  tell  us  that  no  more  subsets  have  to  be 
checked,  since  all  of  the  keys  have  already  been  found.  One  such 
condition  might  be  that  if  all  known  keys  have  cardinality  less 
than  some  integer  n,  and  there  are  no  keys  of  cardinality  n,  then 
there  are  no  more  keys  to  check.  This  condition  would  allow  us 
to  stop  building  up  subsets  when  all  the  subsets  of  a  particular 
cardinality  turn  out  to  yield  no  new  keys.  However,  this 
condition  fails  on  the  example  in  figure  13,  In  this  example, 
the  attributes  X1,X2,X3,X4  together  constitute  a  key  of  F1,  ye+ 
this  key  is  not  synthesized  by  the  synthesis  algorithm.  Even 
though  there  are  no  additional  keys  of  cardinality  less  than  four 
(and  there  are  no  keys  at  all  of  cardinality  3),  this  additional 
key  exists.  It  is  also  easy  to  generalize  the  example  to  an 
arbitrarily  large  cardinality  gap  (i.e.,  no  keys  of  cardinality 
greater  than  two  and  less  than  n,  for  arbitrary  n,  yet  one  key  of 
cardinality  n+1).  This  cardinality  condition,  and  others  like 
it,  fail  for  a  fundamental  reason. 

Let  the  additional  key  problem  be  defined  as  follows: 
Given  a  relation  F  with  a  set  of  keys  and  a  set  of  FDs 
(satisfying  the  conditions  1,  2a,  2b  of  section  III.  1.2),  does  P 
contain  an  additional  key? 

Theorem  9:  The  additional  key  problem  is  NP-complete. 
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Figure  13 

A  Cardinality  Gap  for  Keys 

K  ->  XI,  X2,  X3,  X4 
LI,  L2  ->  K 
XI  “>  *^1,  P!5 
X2  ->  M2,  K6 
X3  ->  M3,  M7 
X4  ->  M4,  MR 
Ml, M2, M3, M4  ->  Li 
M5,M6,M7,M8  ->  L2 

(i)  a  given  set  of  FDs 

(K#  Itlilil,  XI,  X2,  X3,  X4) 

82  (XI.,  Ml,  M5) 

R3(X2,  M2,  M6) 

P4(X3,  M3,  M7) 

R5  (X4,  M4,  MB) 

B 6  { 2 X.M  3 ^ M  4  ,  LI) 

R7  (M5.LMx.il2xMr  ^2) 

(ii)  The  relations  synthesized  from  the  above  FDs. 

In  the  relation  R1,  X1,X2,X3,X4  is  a  non-syn thesized 
key,  even  though  there  are  no  keys  of  cardinality  three. 
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Lucchesi  and  Osborne  [14]  have  treated  some  rel 
problems  in  a  somewhat  restricted  model.  In  their  model,  one 
given  a  set  of  attributes  and  a  set  of  FDs  on  them;  they  def 
a  key  as  a  minimal  subset  that  derives  all  other  attributes, 
essentially  means  that  they  treat  all  attributes  of  the  en 
schema  as  being  collected  into  a  single  relation.  In 
approach  attributes  are  collected  in  several  relations,  A  ke 
a  relation  derives  all  attributes  in  that  relation.  It 
generally  accepted  that  relational  schemas  consist  of 
relations,  as  evidenced  by  all  of  the  relational  systems  buil 
date.  The  division  of  the  set  of  attributes  into  relation 
necessary  for  several  reasons:  normalization  considerations, 
of  use,  size  of  relations,  etc.  He  will  now  consider 
difference  in  approaches  by  comparing  results  in  the  two  mod 
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Since  their  one-relation  model  is  a  special  case  of  the  multiple- 
relation  model ,  these  problem s  are  also  NP-complete  in  the  latter 
model.  However,  it  should  be  noted  that  these  problems  are  NP- 
ccmplete  even  if  we  restrict  ourselves  only  to  mult iple- rela tion 
schemas  synthesized  by  Algorithm  2.  This  is  not  a  direct 
consequence  of  their  results.  To  see  this,  let  us  consider  ♦•he 
prime  attribute  problem  for  synthesized  schemas. 

Suppose  we  are  given  a  set  of  FDs  G  over  the  attributes 
Al,...,Am.  Let  R(Al,...,An)  where  n<m  be  one  of  the  relations 
synthesized  from  G  by  the  synthesis  algorithm.  We  want  to  know 
if  A1  is  prime  in  F,  We  map  this  problem  to  a  similiar  problem 
for  the  one- relationa 1  model.  To  do  this,  we  add  a  new  attribute 
D  and  we  add  the  FD  D  ->  Al,...,Am  and  the  FD  A1,...,An  ->  D  to 
G,  If  we  apply  the  synthesis  algorithm  to  the  new  set  of  FDs, 
there  will  be  one  additional  relation  containing  the  attributes 
D,A1,...,Am.  The  keys  for  this  relation  will  be  D  and  any  key  of 
R.  Thus,  A1  is  prime  in  B  iff  it  is  prime  in  this  relation.  out 
this  relation  contains  all  the  attributes  in  the  schema  so ,  by 
the  Lucchesi  and  Osborne  result,  the  problem  is  NP-complete.  A 
similiar  reduction  can  be  used  for  the  key  of  cardinality  m 
problem. 


The  results  in  this  section  strongly  sugaest  t ha^-  key 
finding  is  an  inherently  difficult  problem.  From  theorem  9  it 
follows  that  if  NP  -•=  P  then  there  is  no  algorithm  that  lists  all 
keys  in  time  polynomial  in  the  size  of  the  relation  and  the  set 
of  FDs,  It  is  true  that  even  if  NP  -•=  P  none  of  the  results 
implies  that  an  algorithm  that  lists  all  keys  of  a  relation  in 
time  polynomial  in  the  number  of  keys  does  not  exist.  It  the 
difficulty  of  the  additional  key  problem  lies  in  the  cases  where 
the  number  of  additional  keys  is  exponential  in  the  size  of  the 
relation.  However,  we  conjecture  that  such  an  algorithm  does  not 
exist . 
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CONCLUSION 
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Our  results  about  the  Eoyce-Codd  normal  form  and  about 
the  key  finding  problems  have  been  all  negative.  We  have  shown 
that  the  problems  of  whether  a  schema  is  in  Boyce-Codd  normal 
form  and  of  whether  a  relation  contains  additional  keys  are  NP- 
comolete.  On  the  basis  of  these  and  other  results  we  argued 
(though  we  could  not  prove)  that  an  efficient  algorithm  that 
produces  Eoyce-Codd  normal  form  schemas  and  an  efficient 
algorithm  that  lists  all  keys  of  a  relation  do  not  exist. 

In  view  of  the  synthesis  algorithm,  the  concept  of 
functional  dependency  has  proven  to  be  a  useful  tool  for  the 
construction  of  relational  schemas.  Recently,  generalizations  of 
this  concept  have  been  suggested.  We  believe  that  functional 
dependencies  (and  their  generalizations)  may  prove  to  be  useful 
in  the  treatment  of  other  problems  of  relational  data  base 
systems. 
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